### CAP 6640 
### Project 1 - Extractive Summarization
### Feb 8, 2024

### Group 4
### Andres Graterol
###                   UCF ID: 4031393
### Zachary Lyons
###                   UCF ID: 4226832
### Christopher Hinkle
###                   UCF ID: 4038573
### Nicolas Leocadio
###                   UCF ID: 3791733

In [39]:
import string 
import nltk 
import re 
import numpy as np
import networkx as nx

from nltk.tokenize import sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from gensim.models import Word2Vec, LsiModel
from scipy import spatial

# Download necessary resources from nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\angel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\angel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Method 1 - TextRank

#### Step 1 - Data Collection

In [40]:
# Gather lengthy articles or a collection of documents that all relate to the same topic (i.e. documents covering an earthquake)
# TextRank: Single-document summarization

'''
    Input: File path to a text file
    Output: String of the text file
'''
def txt_file_to_string(filepath):
    with open(filepath, 'r', encoding='utf8') as file:
        data = file.read()
    return data

# Data is located in text format, character escaped, inside the Documents folder
# TODO: This is a very short sample document to test functionality. When we confirm this works, lets use a larger document.
document_filepath = 'Documents/Japanese_Earthquake-NationalGeographic.txt'
document_text = txt_file_to_string(document_filepath)
print(document_text)

On March 11, 2011, Japan experienced the strongest earthquake in its recorded history. The earthquake struck below the North Pacific, 130 kilometers (81 miles) east of Sendai, the largest city in the Tohoku region, a northern part of the island of Honshu. The Tohoku earthquake caused a tsunami. A tsunami—Japanese for “harbor wave”—is a series of powerful waves caused by the displacement of a large body of water. Most tsunamis, like the one that formed off Tohoku, are triggered by underwater tectonic activity, such as earthquakes and volcanic eruptions. The Tohoku tsunami produced waves up to 40 meters (132 feet) high, More than 450,000 people became homeless as a result of the tsunami. More than 15,500 people died. The tsunami also severely crippled the infrastructure of the country.In addition to the thousands of destroyed homes, businesses, roads, and railways, the tsunami caused the meltdown of three nuclear reactors at the Fukushima Daiichi Nuclear Power Plant. The Fukushima nuclea

#### Step 2 - Data Preprocessing

In [41]:
# TextRank: remove punctuation, tokenize, and remove stopwords

'''
    Purpose: Perform appropriate preprocessing on the text file for the TextRank algorithm
'''
def preprocess_text(text, stop_words):
    tokenized_sentences = sent_tokenize(text, language='english')
    print(tokenized_sentences)

    sentences_to_lower = [sentence.lower() for sentence in tokenized_sentences]
    print(sentences_to_lower)

    # Regular Expression to match any punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Remove the punctuation from the lowercase sentences
    sentences_no_punctuation = [regex.sub('', sentence) for sentence in sentences_to_lower]
    print(sentences_no_punctuation)

    data = [[words for words in sentence.split(' ') if words not in stop_words] for sentence in sentences_no_punctuation]
    return data, tokenized_sentences

# Obtain stopwords from nltk
stop_words = set(stopwords.words('english'))
# Preprocess the text to obtain the data we will use going forward
data, tokenized_sentences = preprocess_text(document_text, stop_words)
print(data)

['On March 11, 2011, Japan experienced the strongest earthquake in its recorded history.', 'The earthquake struck below the North Pacific, 130 kilometers (81 miles) east of Sendai, the largest city in the Tohoku region, a northern part of the island of Honshu.', 'The Tohoku earthquake caused a tsunami.', 'A tsunami—Japanese for “harbor wave”—is a series of powerful waves caused by the displacement of a large body of water.', 'Most tsunamis, like the one that formed off Tohoku, are triggered by underwater tectonic activity, such as earthquakes and volcanic eruptions.', 'The Tohoku tsunami produced waves up to 40 meters (132 feet) high, More than 450,000 people became homeless as a result of the tsunami.', 'More than 15,500 people died.', 'The tsunami also severely crippled the infrastructure of the country.In addition to the thousands of destroyed homes, businesses, roads, and railways, the tsunami caused the meltdown of three nuclear reactors at the Fukushima Daiichi Nuclear Power Plan

#### Step 3 - Feature Engineering

In [45]:
# TextRank: Word Embeddings 
 
# Grab the maximum number of words in a sentence for padding sentence embeddings
max_sentence_length = max([len(sentence) for sentence in data])

'''
    Train the Word2Vec model on the data and calculate embeddings for each word
        min_count: Ignores all words with total frequency lower than this
        vector_size: Dimensionality of the word vectors
'''
# NOTE: If output is unsatsifactory, train for longer epochs
model = Word2Vec(data, min_count=1, vector_size=1)

# Grab sentence embeddings by leveraging the word embeddings and sentence tokens
sentence_embeddings = [[model.wv[word][0] for word in words] for words in data]

# Pad the sentence embeddings with 0's to ensure all sentences have the same length
sentence_embeddings = [np.pad(embedding, (0, max_sentence_length - len(embedding)), 'constant') for embedding in sentence_embeddings]

# Calculate the similarity matrix
# Instantiate a matrix of zeros with the same shape as the number of sentences
similarity_matrix = np.zeros([len(data), len(data)])

# Populate the similarity matrix with cosine similarity scores (same as 1 - cosine distance)
for i, row in enumerate(sentence_embeddings):
    for j, col in enumerate(sentence_embeddings):
        similarity_matrix[i][j] = 1 - spatial.distance.cosine(row, col)

print(similarity_matrix)


[[ 1.          0.12973852 -0.25991294 -0.18357903 -0.31982276  0.43648082
   0.19730724 -0.27834651 -0.19436854]
 [ 0.12973852  1.         -0.10913887 -0.33759162  0.01915257 -0.30675822
   0.12131891  0.19208398 -0.07006265]
 [-0.25991294 -0.10913887  1.          0.37436241  0.36843267 -0.0707379
  -0.67658389  0.23411885 -0.34417024]
 [-0.18357903 -0.33759162  0.37436241  1.          0.25514823 -0.25728783
  -0.38001704 -0.10238185  0.03989816]
 [-0.31982276  0.01915257  0.36843267  0.25514823  1.         -0.48927537
  -0.11403655  0.26633519  0.11107261]
 [ 0.43648082 -0.30675822 -0.0707379  -0.25728783 -0.48927537  1.
   0.02898987 -0.33324486  0.11546853]
 [ 0.19730724  0.12131891 -0.67658389 -0.38001704 -0.11403655  0.02898987
   1.         -0.27139169  0.47717941]
 [-0.27834651  0.19208398  0.23411885 -0.10238185  0.26633519 -0.33324486
  -0.27139169  1.         -0.25321361]
 [-0.19436854 -0.07006265 -0.34417024  0.03989816  0.11107261  0.11546853
   0.47717941 -0.25321361  1.  

#### Step 4 - Algorithm and Results


In [44]:
# TextRank: Call nx's pagerank to get scores. 

''' 
    Get the top n sentences from pagerank scores
'''
def top_n_sentences(n, scores, tokenized_sentences):
    # Key => Sentence 
    # Value => PageRank Score
    sentence_score_dict = {sentence:scores[i] for i, sentence in enumerate(tokenized_sentences)}

    # Filter the dictionary to contain only the top n sentences
    top_sentences = dict(sorted(sentence_score_dict.items(), key=lambda item: item[1], reverse=True)[:n])

    return top_sentences

# Convert similarity matrix to an nx graph and call nx's pagerank
graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(graph, max_iter=5000)

# NOTE: Modify this variable to change the number of sentences in the summary
num_sent_to_extract = 3

extractive_summary = top_n_sentences(num_sent_to_extract, scores, tokenized_sentences)

# Iterate through the dictionary to output the summary
for sentence, score in extractive_summary.items():
    print(sentence)



  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
  err = np.absolute(x - xlast).sum()


PowerIterationFailedConvergence: (PowerIterationFailedConvergence(...), 'power iteration failed to converge within 5000 iterations')

#### Last Step - Evaluation

In [None]:
# NOTE: Evaluation will depend on the method used to implement extractive summarization
#       - ILP (Integer Linear Programming): We can use ROUGE-2 for evaluation
# Andres NOTE: This is the only section that I am unsure of. It would be cool to use ROUGE-2 to compare our TextRank algorithm to the bigram inspection 


### Method 2 - Latent Semantic Indexing (LSI)

#### Step 1 - Data Collection

In [None]:
# Gather lengthy articles or a collection of documents that all relate to the same topic (i.e. documents covering an earthquake)
# LSI (Latent Sentiment Indexing): Multi-document summarization

#### Step 2 - Data Preprocessing

In [None]:
# LSI (Latent Sentiment Indexing): Tokenize, remove stopwords, and stem the words

#### Step 3 - Feature Engineering

In [None]:
# LSI (Latent Sentiment Indexing): Term Frequency 

#### Step 4 - Algorithm and Results

In [None]:
# LSI (Latent Sentiment Indexing): Create LSI Model using Gensim

# Sort documents by weight 

# Sort vectors by score 

# Select top documents 

# Sort sentence numbers in order 

# Obtain the summary

#### Last Step - Evaluation

#### References
##### The following tutorials helped us implement the algorithms in the document:
##### 1. https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390
##### 2. https://towardsdatascience.com/document-summarization-using-latent-semantic-indexing-b747ef2d2af6 