<H3>PRI 2023/24: first project delivery</H3>

**GROUP 11**
- Francisco Martins, 99068
- Tunahan Güneş, 108108
- Sebastian Weidinger, 111612

<H3>Part I: demo of facilities</H3>

In [None]:
import os

In [None]:
def read_files(path):
    texts = []
    file_paths = []
    for folder in os.listdir(path):
        category_path = os.path.join(path, folder)
        texts.append([])
        for file in os.listdir(category_path):
            file_path = os.path.join(category_path, file)
            file_paths.append(file_path)
            with open(file_path, "r", errors="ignore") as f:
                text = f.read()
                texts[-1].append(text)
                
    print("Number of Categories:",len(os.listdir(path)))
    for i in range(len(os.listdir(path))):
        print("Number of Articles in", "'"+os.listdir(path)[i]+"'", "Category:",len(texts[i]))
    return file_paths, texts

In [None]:
dataset_path = os.path.join("BBC News Summary", "BBC News Summary", "News Articles")
print("Dataset path:", dataset_path)
file_paths, categorized_articles = read_files(dataset_path)

In [None]:
#Examplary text. The structure of the read file is: articles[category_no][document_no]. 
print(categorized_articles[0][0])
print(file_paths[508:512])

A) **Indexing** (preprocessing and indexing options)

In [None]:
#code, statistics and/or charts here

imports

In [None]:
import time 
from typing import Union
import nltk
import numpy as np
import math
import torch
import sklearn
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from collections import Counter, defaultdict
from tabulate import tabulate
from transformers import BertTokenizer, BertModel

In [None]:
# flatten list to get uncategorized collection 
def flatten(lists) -> list: 
    return [element for sublist in lists for element in sublist]

articles = flatten(categorized_articles)
N = len(articles)
dict_path_to_articleID = {path:i for i, path in enumerate(file_paths)}

def map_path_to_articleID(path):
    path = os.path.normpath(path)
    return dict_path_to_articleID.get(path)

In [None]:
N

# Inverted Index Structure 
 
Each term points to a dictionary of document identifier and the term frequency in the document.

t1 -> {doc1: TF, doc5: TF, ...}\
t2 -> {doc7: TF, doc8: TF, ...}\
...
t2 -> [DF, {doc7: [TF_(t2, doc7), {s1: TF, s4: TF, ...}], doc8: [TF_(t2, doc8), {s2: TF, s4: TF, ...}], ...}]\

use class structure

TODO: 
* Optimize structure?
    * Is there a more efficient way? 
    * Add maybe pointers to sentences and their term frequency? -> Faster?

In [None]:
max_width = 20

In [None]:
class TermFrequencies: 
    def __init__(self) -> None:
        self.tf_d_t = 0
        self.sent_tf = list()

    def add_sentence(self, sent_number, term_frequency):
        self.sent_tf.append((sent_number, term_frequency))
    
    def __repr__(self):
        padding = 5 - len(str(self.tf_d_t))
        return f'TF_d_t: {self.tf_d_t}{" " * padding}TF_per_sentence: {self.sent_tf}'

In [None]:
class InvertedIndexEntry:
    def __init__(self) -> None:
        self.df_term = 0
        self.term_dict = defaultdict(TermFrequencies)
    
    def get_document(self, document):
        return self.term_dict.get(document, None)

    def get_or_default_document(self, document):
        return self.term_dict[document]

    def update_document(self, document, new_value):
        self.term_dict[document] = new_value
    
    def __repr__(self):
        out = f'Document Frequency: {self.df_term}\n {" " * (max_width+2)} Term frequencies:\n'
        for doc_number, tfs in self.term_dict.items():
            padding = 5 - len(str(doc_number))
            out += f'{" " * (max_width + 3)} Doc {doc_number}{" " * padding}→ {tfs}\n'
        return out
    
    def calculate_df(self):
        self.df_term = len(self.term_dict)

In [None]:
class InvertedIndex:
    def __init__(self, collection_size) -> None:
        self.inverted_index = defaultdict(InvertedIndexEntry)
        self.sentence_lengths = list()
        self.indexing_time = 0
        self.N = collection_size
    
    def __repr__(self):
        out = f'Time to index: {self.indexing_time}\nInverted Index:\n'
        for term, entry in self.inverted_index.items():
            padding = max_width - len(term)
            out += f'{term} {" " * padding} → {entry}\n'
        return out

    def get_or_default(self, term, document):
        return self.inverted_index[term].get_or_default_document(document)
    
    def update(self, term, document, new_value):
        self.inverted_index[term].update_document(document, new_value)
    
    def set_indexing_time(self, indexing_time):
        self.indexing_time = indexing_time
    
    def calculate_dfs(self):
        for entry in self.inverted_index.values():
            entry.calculate_df()  
    
    def get_sentence_lengths(self, document):
        return self.sentence_lengths[document]

    def get_document_info(self, document):          
        info = {'Vocabulary': [], 'DF_t': [], 'TF_d_t': [], 'TF/sentence': []}
        for term, entry in self.inverted_index.items():
            doc_tfs = entry.get_document(document)
            if doc_tfs == None:
                continue
            info['Vocabulary'].append(term)
            info['DF_t'].append(entry.df_term)
            info['TF_d_t'].append(doc_tfs.tf_d_t)
            info['TF/sentence'].append(doc_tfs.sent_tf)
        return info
    
    def doc_to_string(self, document: int):
        out = f'Document id={document} → vocabulary and term frequencies:\n'
        info = self.get_document_info(document)
        table = zip(*info.values())
        headers = info.keys()
        return out + tabulate(table, headers, tablefmt="pretty")


In [None]:
def preprocess(sentence: list, wnl: WordNetLemmatizer, stop_words=set):
    sent_out = list()
    for term in sentence:
        lem_term = wnl.lemmatize(term)
        if lem_term not in stop_words:      
            sent_out.append(lem_term)
    return sent_out

In [None]:
'''
indexing(D,args)
    @input document collection D and optional arguments on text preprocessing

    @behavior preprocesses the collection and, using existing libraries, 
    builds an inverted index with the relevant statistics for the subsequent summarization functions
    
    @output pair with the inverted index I and indexing time
'''
def indexing(articles, **args) -> InvertedIndex:
    start_time = time.time()
    inverted_index = InvertedIndex(len(articles))

    # tokenizer split words and keep hyphons e.g. state-of-the-art
    tokenizer = RegexpTokenizer(r'[\w|-]+')

    wnl = WordNetLemmatizer()

    # loop through collection 
    for article_id, article in enumerate(articles): 
        # split into sentences
        sents = nltk.sent_tokenize(article)
        # remove title (not needed for summarization task)
        sents = sents[1:]
        # save words per sent in list 
        tokenized_sentences = [tokenizer.tokenize(sent.lower()) for sent in sents]
        # calculate length of the sentences in the article
        sent_lengths = [len(sentence_terms) for sentence_terms in tokenized_sentences]
        inverted_index.sentence_lengths.append(sent_lengths)
        # preprocess: lemmatize + stopword removal .
        stop_words = set(stopwords.words('english'))
        tokenized_sentences = [preprocess(sent, wnl, stop_words) for sent in tokenized_sentences]
        # count the term frequencies per sentence
        term_counter_per_sent = [Counter(sentence_terms) for sentence_terms in tokenized_sentences]
        for sent_number, term_counter in enumerate(term_counter_per_sent):
            for term in term_counter: 
                tf = term_counter[term]
                term_document_tfs = inverted_index.get_or_default(term, article_id)
                term_document_tfs.tf_d_t += tf 
                term_document_tfs.add_sentence(sent_number, tf)
                inverted_index.update(term, article_id, term_document_tfs)
    inverted_index.calculate_dfs()
    end_time = time.time()
    indexing_time = end_time - start_time
    inverted_index.set_indexing_time(indexing_time)
    return inverted_index


In [None]:
WordNetLemmatizer().lemmatize("played")

In [None]:
terms = [str(i) for i in range(200000)]
sws = set(stopwords.words('english'))
[term in sws for term in terms]

In [None]:
s0 = 'Title. The little white little rabbit. The person played with the ball.'
s1 = 'Title. The white rabbit\'s ball. Rabbit rabbit ball rabbit.'
s2 = 'Title.  White, the little white rabbit. Little, little.'
test = [s0, s1, s2]
I_test = indexing(test)

In [None]:
print(I_test)

In [None]:
print(I_test.doc_to_string(2))

In [None]:
I = indexing(articles)

In [None]:
print(I.sentence_lengths[0:2])

In [None]:
document_path = os.path.join("BBC News Summary", "BBC News Summary", "News Articles", "business", "509.txt")

print(I.doc_to_string(map_path_to_articleID(document_path)))

# Summarization 

TF: 
* Document: Term frequencies are assessed on document level.
* Sentence: Term frequencies are assessed on sentence level.

IDF: Inverted document frequencies is assessed on collection level.\
\
Additional parameter "N" and "article_id". Is this allowed?

TODO: 
* Evaluate choice and give reason: 
    * IDF on document level?
    * TF on document level for sentences? 
* "order" parameter o
* BM25
* BERT embedding

In [None]:
def tf_idf_term(N, df_t, tf_t_d):
    return (1 + math.log10(tf_t_d)) * math.log10(N/df_t)

In [None]:
def BM25_term(df_t, tf_t_d, N, s_len_avg, s_len, k, b): 
    idf_t = math.log10(N/df_t)
    B = 1 - b + b * (s_len/s_len_avg)
    return idf_t * (tf_t_d * (k + 1))/(tf_t_d + k * B)

In [None]:
def sort_by_value(d: dict, max_sent: int, reverse=False) -> dict: 
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=reverse)[:max_sent])

In [None]:
def get_embedding(sentence: str, model, tokenizer, max_length=512) -> torch.tensor: 
    encoded_input = tokenizer(sentence, return_tensors='pt', truncation=True, max_length=max_length)
    output = model(**encoded_input)
    embedding = output["pooler_output"].squeeze()
    # mean pooled embedding might be better
    # mean_pooled_embedding = last_hidden_states.mean(axis=1)
    return embedding

In [None]:
'''
summarization(d,p,l,o,I,args)
    @input document d (the index in I/D), maximum number of sentences (p) and/or characters (l), order
    of presentation o (appearance in text vs relevance), inverted index I or the
    collection D, and optional arguments on IR models

    @behavior preprocesses d, assesses the relevance of each sentence in d against I ac-
    cording to args, and presents them in accordance with p, l and o
    
    @output summary s of document d, i.e. ordered pairs (sentence position in d, score)
'''
def summarization(d: int, p: int, l: int, o: int, I_or_D: Union[InvertedIndex, list], **args) -> list:

    if args['model'] != 'BERT':

        ## if we receive the collection instead of the inverted index we must compute it first
        if type(I_or_D) == list:
            I = indexing(I_or_D)         
        else: 
            I = I_or_D
        
        doc_info = I.get_document_info(d)
        sentence_lengths = I.get_sentence_lengths(d)
        term_doc_info = zip(*doc_info.values())    

    scores = defaultdict(int)
    sq_normalization_term = defaultdict(int)

    if args['model'] == 'TF-IDF':
        for term, df_t, tf_t_d, tf_per_sentence in term_doc_info:
            rel_t_d = tf_idf_term(I.N, df_t, tf_t_d)
            for sent_number, tf_s_t in tf_per_sentence:
                scores[sent_number] += rel_t_d * (1 + math.log10(tf_s_t))
                sq_normalization_term[sent_number] = (1 + math.log10(tf_s_t))**2
        # normalization
        for sent_number, score in scores.items():
            scores[sent_number] = score / math.sqrt(sq_normalization_term[sent_number])
    
    elif args['model'] == 'BM25':
        k = 0.2
        b = 0.75 
        sentence_lengths = I.sentence_lengths[d]
        avg_sentence_length = sum(sentence_lengths)/len(sentence_lengths)
        for term, df_t, tf_t_d, tf_per_sentence in term_doc_info: 
            for sent_number, tf_s_t in tf_per_sentence: 
                scores[sent_number] += BM25_term(df_t, tf_s_t, I.N, avg_sentence_length, sentence_lengths[sent_number], k, b)
    
    elif args['model'] == 'BERT':
        document = I_or_D[d]
        tokenizer = args['bert_tokenizer']
        bert_model = args['bert_model']
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        bert_model.to(device)
        scores = defaultdict(float)
        # sentences 
        sentences = nltk.sent_tokenize(document)
        sentences = sentences[1:]
        num_sentences = len(sentences)
        sent_embeddings = list()
        # every sentences on its own, no padding needed, faster on cpu
        # for gpu batches are better 
        for sent in sentences: 
            sent_embedding = get_embedding(sent, bert_model, tokenizer)
            sent_embeddings.append(sent_embedding)
        # document
        doc_embedding = get_embedding(document, bert_model, tokenizer, max_length=512)
        for sent_id in range(0, num_sentences): 
            sent_vec = sent_embeddings[sent_id]
            score = torch.nn.functional.cosine_similarity(doc_embedding, sent_vec, dim=0)
            scores[sent_id] = score.item()
            
    else:
        raise ValueError("Currently we only support the following models:\n→ TF-IDF\n→ BM-25\n→ BERT")
    
    sorted_scores = sort_by_value(scores, max_sent=p, reverse=True)
    
    return sorted_scores

In [None]:
article_id = map_path_to_articleID(document_path)
print("ORIGINAL DOCUMENT")
print(articles[article_id])
scores = summarization(d=article_id, p=5, l=1000, o="rel", I_or_D=I, model='TF-IDF')

print("SUMMARY")
sentences = nltk.sent_tokenize(articles[article_id])
sentences = sentences[1:]
for sent_id, score in scores.items(): 
    print(f"{score:.2f}: {sentences[sent_id]}")

In [None]:
article_id = map_path_to_articleID(document_path)
print("ORIGINAL DOCUMENT")
print(articles[article_id])
scores = summarization(d=article_id, p=5, l=1000, o="rel", I_or_D=I, model='BM25')

print("SUMMARY")
sentences = nltk.sent_tokenize(articles[article_id])
sentences = sentences[1:]
for sent_id, score in scores.items(): 
    print(f"{score:.2f}: {sentences[sent_id]}")

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
bert_model = BertModel.from_pretrained("bert-base-uncased")

In [None]:
article_id = map_path_to_articleID(document_path)
print("ORIGINAL DOCUMENT")
print(articles[article_id])
scores = summarization(d=article_id, p=5, l=1000, o="rel", I_or_D=articles, model='BERT', bert_model=bert_model, bert_tokenizer=bert_tokenizer)

print("SUMMARY")
sentences = nltk.sent_tokenize(articles[article_id])
sentences = sentences[1:]
for sent_id, score in scores.items(): 
    print(f"{score:.2f}: {sentences[sent_id]}")

# Keyword Extraction

Calculates the keywords based on the tf-idf of the document.\
\
Additional parameter "N" and "article_id". Is this allowed?

Parameter for including only noun phrases. 

No need of BERT (see assigment sheet, p.4 IR Models)

should be primarly based on TF-IDF


TODO:
* Nouns: just unigrams or also bigrams?


In [None]:
from nltk.classify import Senna
from nltk.tag import SennaChunkTagger
from nltk.corpus import stopwords

In [None]:
'''
keyword extraction(d,p,I,args)
    @input document d, maximum number of keywords p, inverted index I, and op-
    tional arguments on IR model choices

    @behavior extracts the top informative p keywords in d against I according to args
    
    @output ordered set of p keywords
'''

In [None]:
def keyword_extraction(d: int, p: int ,I: InvertedIndex, **args) -> dict:
     
    doc_info = I.get_document_info(d)
    term_doc_info = zip(*doc_info.values())    

    scores = defaultdict(str)

    for term, df_t, tf_t_d, tf_per_sentence in term_doc_info:
        rel_t_d = tf_idf_term(I.N, df_t, tf_t_d)
        scores[term] = rel_t_d
    scores = sort_by_value(scores, p, reverse=True)
    return scores         

In [None]:
article_id = map_path_to_articleID(document_path)
scores = keyword_extraction(article_id, 10, I)
print(scores)

print(I.doc_to_string(article_id))

# Evaluation

TODO:
* Implement evaluation
* Evaluation:
    * Statistics 
    * F-meassure
    * Recall-precision-curve
    * MAP
    * Efficiency

In [None]:
'''
evaluation(Sset,Rset,args)
    @input the set of summaries Sset produced from selected documents Dset ⊆ D
    (e.g. a single document, a category of documents, the whole collection),
    the corresponding reference extracts Rset, and optional arguments (evalu-
    ation, preprocessing, model options)

    @behavior assesses the produced summaries against the reference ones using the tar-
    get evaluation criteria

    @output evaluation statistics, including F-measuring at predefined p-or-l summary
    limits, recall-and-precision curves, MAP, and efficiency
'''

In [None]:
def evaluation(S: list, R: list, args=None) -> list:
    # do evaluation... 
    pass

subset for BERT comparison\
takes too long otherwise\ 

TODO: 
* evaluation + first question -> Sebastian   
* question 2, 3, 4 -> Francisco
* question 5 and 6 -> Tuna

B) **Summarization**

*B.1 Summarization solution: results for a given document*

In [None]:
#code, statistics and/or charts here
article_id = 0
print(articles[article_id])
summarization(articles[article_id], num_sent=5, order="rel", inverted_index=inverted_index, N=len(articles), article_id=article_id) 

*B.2 IR models (TF-IDF, BM25 and EBRT)*

In [None]:
#code, statistics and/or charts here

*B.3 Reciprocal rank funsion*

In [None]:
#code, statistics and/or charts here

*B.4 Maximal Marginal Relevance*

In [None]:
#code, statistics and/or charts here

C) **Keyword extraction**

In [None]:
#code, statistics and/or charts here
article_id = 0
print(articles[article_id])
keyword_extraction(articles[0], 10, inverted_index, len(articles), 0, use_only_nouns=True)

D) **Evaluation**

In [None]:
#code, statistics and/or charts here

<H3>Part II: questions materials (optional)</H3>

**(1)** Corpus *D* and summaries *S* description.

In [None]:
#code, statistics and/or charts here

**(2)** Summarization performance for the overall and category-conditional corpora.

In [None]:
#code, statistics and/or charts here

**...** (additional questions with empirical results)

<H3>END</H3>