### CAP 6640 
### Project 1 - Extractive Summarization
### Feb 8, 2024

### Group 4
### Andres Graterol
###                   UCF ID: 4031393
### Zachary Lyons
###                   UCF ID: 4226832
### Christopher Hinkle
###                   UCF ID: 4038573
### Nicolas Leocadio
###                   UCF ID: 3791733

In [1]:
import string 
import nltk 
import re 
import numpy as np
import networkx as nx
import csv

from rouge_score import rouge_scorer
from nltk.tokenize import sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from gensim.models import Word2Vec, LsiModel
from gensim import corpora
from scipy import spatial

# Download necessary resources from nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nick_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nick_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Method 1 - TextRank

#### Step 1 - Data Collection

In [317]:
# Gather lengthy articles or a collection of documents that all relate to the same topic (i.e. documents covering an earthquake)
# TextRank: Single-document summarization

'''
    Input: File path to a text file
    Output: String of the text file
'''
def txt_file_to_string(filepath):
    with open(filepath, 'r', encoding='utf8') as file:
        data = file.read()
        data = data.replace('\n', ' ') # Remove newline characters
    return data

# Data is located in text format, character escaped, inside the Documents folder
# TODO: This is a very short sample document to test functionality. When we confirm this works, lets use a larger document.
document_filepath = 'Documents/Japanese_Earthquake-NationalGeographic.txt'
document_text = txt_file_to_string(document_filepath)
print(document_text)

#### Step 2 - Data Preprocessing

In [318]:
# TextRank: remove punctuation, tokenize, and remove stopwords

'''
    Purpose: Perform appropriate preprocessing on the text file for the TextRank algorithm
'''
def preprocess_text(text, stop_words):
    tokenized_sentences = sent_tokenize(text, language='english')

    sentences_to_lower = [sentence.lower() for sentence in tokenized_sentences]

    # Regular Expression to match any punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Remove the punctuation from the lowercase sentences
    sentences_no_punctuation = [regex.sub('', sentence) for sentence in sentences_to_lower]

    data = [[words for words in sentence.split(' ') if words not in stop_words] for sentence in sentences_no_punctuation]
    return data, tokenized_sentences

# Obtain stopwords from nltk
stop_words = set(stopwords.words('english'))
# Preprocess the text to obtain the data we will use going forward
data, tokenized_sentences = preprocess_text(document_text, stop_words)
print(data)

#### Step 3 - Feature Engineering

In [319]:
# TextRank: Word Embeddings 
 
# Grab the maximum number of words in a sentence for padding sentence embeddings
max_sentence_length = max([len(sentence) for sentence in data])

'''
    Train the Word2Vec model on the data and calculate embeddings for each word
        min_count: Ignores all words with total frequency lower than this
        vector_size: Dimensionality of the word vectors
'''
# NOTE: If output is unsatsifactory, train for longer epochs
model = Word2Vec(data, min_count=1, vector_size=1, epochs=5000)

# Grab sentence embeddings by leveraging the word embeddings and sentence tokens
sentence_embeddings = [[model.wv[word][0] for word in words] for words in data]

# Pad the sentence embeddings with 0's to ensure all sentences have the same length
sentence_embeddings = [np.pad(embedding, (0, max_sentence_length - len(embedding)), 'constant') for embedding in sentence_embeddings]

# Calculate the similarity matrix
# Instantiate a matrix of zeros with the same shape as the number of sentences
similarity_matrix = np.zeros([len(data), len(data)])

# Populate the similarity matrix with cosine similarity scores (same as 1 - cosine distance)
for i, row in enumerate(sentence_embeddings):
    for j, col in enumerate(sentence_embeddings):
        similarity_matrix[i][j] = 1 - spatial.distance.cosine(row, col)

print(similarity_matrix)


#### Step 4 - Algorithm and Results


In [478]:
# TextRank: Call nx's pagerank to get scores. 

''' 
    Get the top n sentences from pagerank scores
'''
def top_n_sentences(n, scores, tokenized_sentences):
    # Key => Sentence 
    # Value => PageRank Score
    sentence_score_dict = {sentence:scores[i] for i, sentence in enumerate(tokenized_sentences)}

    # Filter the dictionary to contain only the top n sentences
    top_sentences = dict(sorted(sentence_score_dict.items(), key=lambda item: item[1], reverse=True)[:n])

    return top_sentences

# Convert similarity matrix to an nx graph and call nx's pagerank
graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(graph)

# NOTE: Modify this variable to change the number of sentences in the summary
num_sent_to_extract = 2

extractive_summary = top_n_sentences(num_sent_to_extract, scores, tokenized_sentences)

# Iterate through the dictionary to output the summary
for sentence, score in extractive_summary.items():
    print(sentence)



The tsunami raced outward from the epicentre at speeds that approached about 500 miles (800 km) per hour.
In addition to Sendai, other communities hard-hit by the tsunami included Kamaishi and Miyako in Iwate; Ishinomaki, Kesennuma, and Shiogama in Miyagi; and Kitaibaraki and Hitachinaka in Ibaraki.


#### Last Step - Evaluation

In [321]:
# NOTE: Evaluation will depend on the method used to implement extractive summarization
#       - ILP (Integer Linear Programming): We can use ROUGE-2 for evaluation
# Andres NOTE: This is the only section that I am unsure of. It would be cool to use ROUGE-2 to compare our TextRank algorithm to the bigram inspection


def csv_column_to_list(file_path, column_index):
    column_data = []
    with open(file_path, encoding="utf8") as file:
        csv_reader = csv.reader(file)
        for row in csv_reader:
            if len(row) > column_index:  # Ensure the row has the desired column
                column_data.append(row[column_index].replace("\n"," "))

    return column_data

csvFile = "./Dataset/CnnTestData.csv"

# Get the list of articles and human summaries that we are going to be evaluating
testDocs = csv_column_to_list(csvFile,1)
testDocs = testDocs[1:21]

humanSumm = csv_column_to_list(csvFile,2)
humanSumm = humanSumm[1:21]

In [322]:
# Get our models summarizations of the documents
modelSumms = []

for doc in testDocs:
    data, tokenized_sentences = preprocess_text(doc, stop_words)
    max_sentence_length = max([len(sentence) for sentence in data])
    model = Word2Vec(data, min_count=1, vector_size=1, epochs=5000)

    # Grab sentence embeddings by leveraging the word embeddings and sentence tokens
    sentence_embeddings = [[model.wv[word][0] for word in words] for words in data]

    # Pad the sentence embeddings with 0's to ensure all sentences have the same length
    sentence_embeddings = [np.pad(embedding, (0, max_sentence_length - len(embedding)), 'constant') for embedding in sentence_embeddings]

    # Calculate the similarity matrix
    # Instantiate a matrix of zeros with the same shape as the number of sentences
    similarity_matrix = np.zeros([len(data), len(data)])

    # Populate the similarity matrix with cosine similarity scores (same as 1 - cosine distance)
    for i, row in enumerate(sentence_embeddings):
        for j, col in enumerate(sentence_embeddings):
            similarity_matrix[i][j] = 1 - spatial.distance.cosine(row, col)

    graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(graph)
    # NOTE: Modify this variable to change the number of sentences in the summary
    num_sent_to_extract = 4

    extractive_summary = top_n_sentences(num_sent_to_extract, scores, tokenized_sentences)

    # Iterate through the dictionary to output the summary
    s = ""
    for sentence, score in extractive_summary.items():
        s = s + sentence
    
    modelSumms.append(s)

print(modelSumms[0])


In [323]:
# Now that we have our models summaries we can compare them to our Human made ones using Rouge
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

allScores = [[],[],[]]
for i in range(len(modelSumms)):
    score = scorer.score(target=humanSumm[i],prediction=modelSumms[i])
    r1fscore = score['rouge1'].fmeasure
    r2fscore = score['rouge2'].fmeasure
    rLfscore = score['rougeL'].fmeasure
    allScores[0].append(r1fscore)
    allScores[1].append(r2fscore)
    allScores[2].append(rLfscore)

# List of F-scores in the order ['rouge1', 'rouge2', 'rougeL']
print(allScores)



In [324]:
# Getting the average F-score of the three metrics

alg1rouge1 = sum(allScores[0]) / len(allScores[0])
alg1rouge2 = sum(allScores[1]) / len(allScores[1])
alg1rougeL = sum(allScores[2]) / len(allScores[2])

print("Text Ranks rouge-1 f-score = ")
print(alg1rouge1)
print("Text Ranks rouge-2 f-score = ")
print(alg1rouge2)
print("Text Ranks rouge-L f-score = ")
print(alg1rougeL)

### Method 2 - Latent Semantic Indexing (LSI)

#### Step 1 - Data Collection

In [8]:
# Gather lengthy articles or a collection of documents that all relate to the same topic (i.e. documents covering an earthquake)
# LSI (Latent Sentiment Indexing): Multi-document summarization
# Gather lengthy articles or a collection of documents that all relate to the same topic (i.e. documents covering an earthquake)
# TextRank: Single-document summarization

'''
    Input: File path to multiple text files
    Output: List of multiple text
'''
def txt_files_to_string(filepaths) -> list[list[str]]:
    i = 0
    document_list = []
    for file in filepaths:
        with open(file, 'r', encoding='utf8') as file:
            data = file.read()
            data = data.replace('\n', ' ') # Remove newline characters
            document_list.append(data)
    return document_list
#print(data)
# Data is located in text format, character escaped, inside the Documents folder
document_filepath_1 = 'Documents/Japanese_Earthquake-NationalGeographic.txt'
document_filepath_2 = 'Documents/Japanese_Earthquake-Britannica.txt'
documents = [document_filepath_1, document_filepath_2]
document_text_list = txt_files_to_string(documents)


#### Step 2 - Data Preprocessing

In [9]:
# LSI (Latent Sentiment Indexing): Tokenize, remove stopwords, and stem the words
def preprocess_lsi_text(document_list) -> list:
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    processed_docs = []
    tokenized_documents = []
    
    for doc in document_list:
    # Tokenizer
        tokenized_sentences = sent_tokenize(doc, language='english')
        tokenized_documents.append(tokenized_sentences)
    # LowerCase
        sentences_to_lower = [sentence.lower() for sentence in tokenized_sentences]
    # Remove Punctuation
        regex = re.compile('[%s]' % re.escape(string.punctuation))
        sentences_no_punctuation = [regex.sub('', sentence) for sentence in sentences_to_lower]
    # Remove Stop words
        removed_stop_words = [[words for words in sentence.split(' ') if words not in stop_words] for sentence in sentences_no_punctuation]
    
    # Stemming
        stemmed_words = []
        stemmed_sentences = []
        for sentences in removed_stop_words:
            stemmed_words = []
            for word in sentences:
                stemmed_words.append(stemmer.stem(word))
            stemmed_sentences.append(stemmed_words)
            
        processed_docs.append(stemmed_sentences)
    return processed_docs, tokenized_documents

processed_docs, tokenized_documents = (preprocess_lsi_text(document_text_list))


#### Step 3 - Feature Engineering

In [10]:
# LSI (Latent Sentiment Indexing): Term Frequency
def create_dict_bow(document):
    dictionary = corpora.Dictionary(document)
    bow_corpus = [dictionary.doc2bow(doc) for doc in document]
    return dictionary, bow_corpus

dictionary, bow_corpus = create_dict_bow(processed_docs[0])

#### Step 4 - Algorithm and Results

In [11]:
# LSI (Latent Sentiment Indexing): Create LSI Model using Gensim
lsi_model = LsiModel(bow_corpus, num_topics=2, id2word=dictionary)
# Rank sentences based on similarity to first sentence
sentence_scores = lsi_model[bow_corpus]
print(lsi_model.print_topics(num_topics=2, num_words=4))

[(0, '0.401*"tsunami" + 0.368*"nuclear" + 0.224*"fukushima" + 0.224*"busi"'), (1, '-0.356*"tohoku" + -0.289*"earthquak" + -0.232*"tsunami" + 0.227*"nuclear"')]


In [12]:
def takenext(elem):
    return elem[1]

def sort_scores(sentence_scores):
    sorted_scores = [[] for i in range(2)]
    for i, docv in enumerate(sentence_scores):
        for score in docv:
            sorted_scores[score[0]].append((i, abs(score[1])))
    sorted_scores = list(map(lambda x: sorted(x, key=takenext, reverse=True), sorted_scores))
    return sorted_scores

def select_top_sent_lsi(sorted_scores, summary_len, num_topics):
    top_sentences = []
    sentence_set = set()
    total_sentences = 0
    for i in range(summary_len):
        for j in range(num_topics):
            score_vectors = sorted_scores[j]
            sentence = score_vectors[i][0]
            if sentence not in sentence_set:
                top_sentences.append(score_vectors[i])
                sentence_set.add(sentence)
                total_sentences += 1
                if total_sentences == summary_len:
                    return top_sentences

sorted_scores = sort_scores(sentence_scores)
top_scores = select_top_sent_lsi(sorted_scores, 2, 2)
print(top_scores)
top_scores.sort()
print(top_scores)

[(8, 3.7066956771574504), (1, 3.012181357850092)]
[(1, 3.012181357850092), (8, 3.7066956771574504)]


In [13]:
def create_sentence_list(top_sentences):
    top_sentence_list = []
    for i in top_sentences:
        top_sentence_list.append(i[0])
    return top_sentence_list

top_sentence_list = create_sentence_list(top_scores)

In [14]:
summary = []
doc = []
count = 0
#print(tokenized_documents)
for document in tokenized_documents:
    print(document)
    count = 0
    summary = []
    for sentence in document:
        doc.append(sentence)
        if count in top_sentence_list:
            summary.append(sentence)
        count += 1

    
summary = " ".join(summary)
doc = " ".join(doc)
print()
#print(doc)
print()
print(summary)

['On March 11, 2011, Japan experienced the strongest earthquake in its recorded history.', 'The earthquake struck below the North Pacific, 130 kilometers (81 miles) east of Sendai, the largest city in the Tohoku region, a northern part of the island of Honshu.', 'The Tohoku earthquake caused a tsunami.', 'A tsunami—Japanese for “harbor wave”—is a series of powerful waves caused by the displacement of a large body of water.', 'Most tsunamis, like the one that formed off Tohoku, are triggered by underwater tectonic activity, such as earthquakes and volcanic eruptions.', 'The Tohoku tsunami produced waves up to 40 meters (132 feet) high, More than 450,000 people became homeless as a result of the tsunami.', 'More than 15,500 people died.', 'The tsunami also severely crippled the infrastructure of the country.', 'In addition to the thousands of destroyed homes, businesses, roads, and railways, the tsunami caused the meltdown of three nuclear reactors at the Fukushima Daiichi Nuclear Power 

#### Last Step - Evaluation

#### References
##### The following tutorials helped us implement the algorithms in the document:
##### 1. https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390
##### 2. https://towardsdatascience.com/document-summarization-using-latent-semantic-indexing-b747ef2d2af6 