# Text summarization with NLP libraries: nltk and SpaCy (bonus - phi3 model from Microsoft)

## Imports

In [86]:
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation
from heapq import nlargest

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [87]:
# Text for processing
# text = "This is an example sentence for tokenization and lemmatization."
text = "The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound."

print(f'Original text:\nChars: {len(text)}')

# % sentences
percent_centenses = 0.4

Original text:
Chars: 2906


## NLTK Summarization

In [88]:
# NLTK summarization
stop_words = set(stopwords.words('english'))

# Tokenization
sentences = sent_tokenize(text)

word_frequencies = {}
for word in word_tokenize(text):
    if word.lower() not in stop_words and word.lower() not in punctuation:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

maximum_frequency = max(word_frequencies.values())

# Normalize word frequencies
for word in word_frequencies:
    word_frequencies[word] = word_frequencies[word] / maximum_frequency

# Compute sentence scores
sentence_scores = {}
for sentence in sentences:
    for word in word_tokenize(sentence.lower()):
        if word in word_frequencies:
            if len(sentence.split(' ')) < 30:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_frequencies[word]
                else:
                    sentence_scores[sentence] += word_frequencies[word]

# % sentences for summarization
select_length = int(len(sentences) * percent_centenses)
nltk_summary_sentences = nlargest(select_length, sentence_scores, key=sentence_scores.get)
nltk_summary = ' '.join(nltk_summary_sentences)
print(f'NLTK summary:\nSentences: {len(nltk_summary_sentences)} Chars: {len(nltk_summary)}\n{nltk_summary}')

NLTK summary:
Sentences: 6 Chars: 842
Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory.


## SpaCy Summarization

In [89]:
# SpaCy summarization
nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

# Extract sentences
sentences = [sent.text for sent in doc.sents]

punctuation = punctuation + '\n'

# Tokenize words and remove stop words and punctuation
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stop_words and word.text.lower() not in punctuation:
        if word.text not in word_frequencies:
            word_frequencies[word.text] = 1
        else:
            word_frequencies[word.text] += 1

maximum_frequency = max(word_frequencies.values())

# Normalize word frequencies
for word in word_frequencies:
    word_frequencies[word] = word_frequencies[word] / maximum_frequency

# Compute sentence scores
sentence_scores = {}
for sentence in sentences:
    for word in nlp(sentence.lower()):
        if word.text in word_frequencies:
            if len(sentence.split(' ')) < 30:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_frequencies[word.text]
                else:
                    sentence_scores[sentence] += word_frequencies[word.text]

# % sentences for summarization
select_length = int(len(sentences) * percent_centenses)
spacy_summary_sentences = nlargest(select_length, sentence_scores, key=sentence_scores.get)
spacy_summary = ' '.join(spacy_summary_sentences)
print(f'SpaCy summary:\nSentences: {len(spacy_summary_sentences)} Chars: {len(spacy_summary)}\n{spacy_summary}')

SpaCy summary:
Sentences: 6 Chars: 842
Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory.


## Local phi3:latest model installed via ollama Summarization
* install ollama from https://ollama.com/ for you OS
* run in the console `ollama run phi3:latest`
* run cell below

In [2]:
#ollama run phi3:latest
import requests
import json


OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"
system_prompt = 'Your goal is to summarize the text given to you in roughly less than 900 chars. Only output the summary without any additional text. Focus on providing a summary in freeform text with what people said and the action items coming out of it. Text: '
text = "The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound."


OLLAMA_PROMPT = f"{system_prompt}: {text}"
OLLAMA_DATA = {
     "model": "phi3:latest",
     "prompt": OLLAMA_PROMPT,
     "stream": False,
     "keep_alive": "1m",
  }

response = requests.post(OLLAMA_ENDPOINT, json=OLLAMA_DATA)
ollama_summary = response.json()["response"]

print(f'Phi3 summary:\nChars: {len(ollama_summary)}\n{ollama_summary}')


Phi3 summary:
Chars: 726
The Orbiter Discovery (OV-103) is eligible for listing on the National Register of Historic Places due to its significance within the U.S. Space Shuttle Program, highlighting it as the oldest orbiter constructed and achieving notable milestones such as first flight in 1984, twenty missions, Hubble servicing, ISS construction contributions, and engineering innovations like reusable TPS and two-fault tolerant avionics system. It was also involved after Challenger/Columbia disasters, with the redesigned SRBs and SSME enhancements. Discovery flew thirty-nine times, surpassing other orbiters, making it a valuable candidate for historic preservation under Criteria A (Space Exploration & Transportation) and C (Engineering).


## Text Summarization Project Summary

This project compared three different approaches to text summarization: NLTK, SpaCy, and phi3 (accesed via Ollama model from Microsoft). The goal was to summarize a given text snippet about the Space Shuttle Orbiter Discovery.

**Text:**

The passage  discussed the Space Shuttle Orbiter Discovery (OV-103) and its eligibility for listing on the National Register of Historic Places. It highlighted Discovery's role in the U.S. Space Shuttle Program, including its construction, missions flown, and engineering advancements. 

**Desired Summary Length:**

Approximately 6 sentences and under 900 characters.

**Comparison of Results:**

| Summarization Technique | Summary | Character Count |
|---|---|---|
| NLTK | Focused on the engineering achievements of Discovery, including the reusable TPS and first two-fault-tolerant avionics system. Also mentioned Discovery's role in DoD missions and the ISS. (6 sentences, 842 characters) | 842 |
| SpaCy | Identical to NLTK summary. (6 sentences, 842 characters) | 842 |
| phi3 | Provided a broader summary including Discovery's construction, notable milestones (first flight, missions, Hubble servicing, ISS contributions), engineering innovations, and involvement after the Challenger and Columbia accidents. (**726 characters**) | 726 |

**Observations:**

* NLTK and SpaCy produced identical summaries, focusing on the engineering aspects and involvement with the DoD and ISS.
* phi3 generated a more concise summary that captured the key points of Discovery's significance, including its historical role, achievements, and engineering advancements.

**Overall:**

phi3 achieved the desired summary length while providing a more comprehensive overview of the text compared to NLTK and SpaCy. While NLTK and SpaCy are good options for extractive summarization based on word frequencies, phi3 demonstrates the capabilities of large language models for abstractive summarization that captures the overall meaning and key points of the text, but requires additional setup.

