# Text Summary Using NLP

## Setup

### Import Libraries

In [761]:
import spacy  # pretrained pipelines, tokenization & training for 70+ languages
import nltk  # Natural Language Toolkit
from heapq import nlargest  # find largest elements from an iterable object
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation

nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Language Model for English
The model must be loaded before running the cell below:
--> python -m spacy download en_core_web_sm

In [762]:
nlp = spacy.load("en_core_web_sm")

### Input Text

In [763]:
with open("text.txt", "r", encoding="utf-8") as fd:
    text = fd.read()

In [764]:
print("Length in characters: ", len(text))
print(text)
chars = sorted(list(set(text)))
print(f"{len(chars)} unique characters:")
print("".join(chars))

Length in characters:  2907
The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first 

### Text and Words Tokenization

In [765]:
doc = nlp(text)
tokens_1 = [token.text for token in doc]
print(tokens_1)
print(f"In total {len(tokens_1)} tokens with punctuation and")
print(f"{len(set(tokens_1))} unique tokens")

['The', 'Orbiter', 'Discovery', ',', 'OV-103', ',', 'is', 'considered', 'eligible', 'for', 'listing', 'in', 'the', 'National', 'Register', 'of', 'Historic', 'Places', '(', 'NRHP', ')', 'in', 'the', 'context', 'of', 'the', 'U.S.', 'Space', 'Shuttle', 'Program', '(', '1969', '-', '2011', ')', 'under', 'Criterion', 'A', 'in', 'the', 'areas', 'of', 'Space', 'Exploration', 'and', 'Transportation', 'and', 'under', 'Criterion', 'C', 'in', 'the', 'area', 'of', 'Engineering', '.', 'Because', 'it', 'has', 'achieved', 'significance', 'within', 'the', 'past', 'fifty', 'years', ',', 'Criteria', 'Consideration', 'G', 'applies', '.', 'Under', 'Criterion', 'A', ',', 'Discovery', 'is', 'significant', 'as', 'the', 'oldest', 'of', 'the', 'three', 'extant', 'orbiter', 'vehicles', 'constructed', 'for', 'the', 'Space', 'Shuttle', 'Program', '(', 'SSP', ')', ',', 'the', 'longest', 'running', 'American', 'space', 'program', 'to', 'date', ';', 'she', 'was', 'the', 'third', 'of', 'five', 'orbiters', 'built', 'b

In [766]:
tokens_2 = word_tokenize(text)
print(tokens_2)
print(f"In total {len(tokens_2)} tokens  with punctuation and stopwords")
print(f"{len(set(tokens_2))} unique tokens")

['The', 'Orbiter', 'Discovery', ',', 'OV-103', ',', 'is', 'considered', 'eligible', 'for', 'listing', 'in', 'the', 'National', 'Register', 'of', 'Historic', 'Places', '(', 'NRHP', ')', 'in', 'the', 'context', 'of', 'the', 'U.S.', 'Space', 'Shuttle', 'Program', '(', '1969-2011', ')', 'under', 'Criterion', 'A', 'in', 'the', 'areas', 'of', 'Space', 'Exploration', 'and', 'Transportation', 'and', 'under', 'Criterion', 'C', 'in', 'the', 'area', 'of', 'Engineering', '.', 'Because', 'it', 'has', 'achieved', 'significance', 'within', 'the', 'past', 'fifty', 'years', ',', 'Criteria', 'Consideration', 'G', 'applies', '.', 'Under', 'Criterion', 'A', ',', 'Discovery', 'is', 'significant', 'as', 'the', 'oldest', 'of', 'the', 'three', 'extant', 'orbiter', 'vehicles', 'constructed', 'for', 'the', 'Space', 'Shuttle', 'Program', '(', 'SSP', ')', ',', 'the', 'longest', 'running', 'American', 'space', 'program', 'to', 'date', ';', 'she', 'was', 'the', 'third', 'of', 'five', 'orbiters', 'built', 'by', 'NAS

The second method shows fewer tokens because it does not split compound words ("twenty-five" vs "twenty" "-" "five"), however both do not remove punctuation and stopwords.

In [767]:
sentence_tokens = sent_tokenize(text)
print(sentence_tokens)
print(f"In total {len(sentence_tokens)} sentences")

['The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering.', 'Because it has achieved significance within the past fifty years, Criteria Consideration G applies.', 'Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA.', 'Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station.', 'Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly

### Stopwords
Words that do not carry significant information.

In [768]:
stop_words = set(stopwords.words("english"))
print(stop_words)
print(f"In total {len(stop_words)} stopwords")

{'being', 'after', 'when', 'in', 'needn', 'shouldn', 'for', 'there', 'does', 'her', 'hadn', 'theirs', "didn't", 'ours', 'had', 'again', 'to', 'now', 'ain', 'do', 'it', 'was', 'your', "doesn't", "it's", 'below', "she's", 'having', 'these', 'wouldn', 'only', 'couldn', 'same', 'once', 'mightn', 'aren', 'against', 'what', 'not', 'any', 'myself', 's', 'off', 'further', 'over', 'few', 'll', 'very', 'just', 'isn', 'where', 'can', 'd', 'those', 'hers', 'this', 'how', "you'd", 'above', 'weren', "mustn't", "shouldn't", "wasn't", 'down', 'as', 'our', 'itself', "you'll", 'about', 'yourselves', 'y', 'which', 'or', 'that', 'is', 'both', "hasn't", 'i', 'some', 'them', 'before', 'most', 'has', 'were', 'such', 've', 'ma', 'an', "hadn't", 'into', 'yourself', 'while', 'herself', 'hasn', 'me', 'because', "wouldn't", 'won', 'between', "don't", 'my', 'with', 'more', 'himself', 'doesn', 'am', 'themselves', 'during', 'up', "you've", 't', 'doing', 'ourselves', 'did', "isn't", 'no', 'o', "weren't", 'too', 'its'

In [769]:
punctuation = punctuation + "\n" + "“" + "”" + "’"

### Token Frequencies

In [770]:
def get_tokens_frequencies(document):
    word_frequencies = {}
    for word in document:
        word = word.text.lower()
        if word not in stop_words and word not in punctuation:
            if word not in ("’s", "c", "g"):
                if word not in word_frequencies.keys():
                    word_frequencies[word] = 1
                else:
                    word_frequencies[word] += 1
    return word_frequencies

In [771]:
word_frequencies = get_tokens_frequencies(doc)
print(word_frequencies)
print(f"In total {len(word_frequencies)} tokens")

{'orbiter': 7, 'discovery': 7, 'ov-103': 1, 'considered': 1, 'eligible': 1, 'listing': 1, 'national': 1, 'register': 1, 'historic': 1, 'places': 1, 'nrhp': 1, 'context': 1, 'u.s.': 2, 'space': 13, 'shuttle': 8, 'program': 3, '1969': 1, '2011': 1, 'criterion': 4, 'areas': 1, 'exploration': 1, 'transportation': 1, 'area': 1, 'engineering': 3, 'achieved': 1, 'significance': 1, 'within': 1, 'past': 1, 'fifty': 1, 'years': 1, 'criteria': 1, 'consideration': 1, 'applies': 1, 'significant': 2, 'oldest': 1, 'three': 1, 'extant': 1, 'vehicles': 2, 'constructed': 1, 'ssp': 2, 'longest': 1, 'running': 1, 'american': 1, 'date': 1, 'third': 1, 'five': 3, 'orbiters': 2, 'built': 1, 'nasa': 1, 'unlike': 1, 'mercury': 1, 'gemini': 1, 'apollo': 1, 'programs': 1, 'emphasis': 1, 'cost': 1, 'effectiveness': 1, 'reusability': 1, 'eventually': 1, 'construction': 2, 'station': 3, 'including': 1, 'maiden': 1, 'voyage': 1, 'launched': 1, 'august': 1, '30': 1, '1984': 1, 'flew': 3, 'thirty': 2, 'nine': 1, 'time

## Text Summary

### Determination of the most important sentences for the summary by their score

In [772]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word]
            else:
                sentence_scores[sent] += word_frequencies[word]

print(sentence_scores)

{'The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering.': 78, 'Because it has achieved significance within the past fifty years, Criteria Consideration G applies.': 9, 'Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA.': 80, 'Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station.': 30, 'Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first

### Full Summary

In [773]:
select_length = int(len(sentence_tokens))
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
summary = " ".join(summary)
print(summary)

According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transpo

### Short Summary

In [774]:
select_length = int(len(sentence_tokens) * 0.25)
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
summary = " ".join(summary)
print(summary)

According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transpo

In [775]:
with open("summary.txt", "w", encoding="utf-8") as fd:
    fd.write(summary)