<H3>PRI 2023/24: first project delivery</H3>

**GROUP 11**
- Francisco Martins, 99068
- Tunahan Güneş, 108108
- Sebastian Weidinger, 111612

<H3>Part I: demo of facilities</H3>

In [47]:
import os

In [48]:
def read_files(path):
    texts = []
    for folder in os.listdir(path):
        category_path = os.path.join(path, folder)
        texts.append([])
        for file in os.listdir(category_path):
            file_path = os.path.join(category_path, file)
            with open(file_path, "r", errors="ignore") as f:
                text = f.read()
                texts[-1].append(text)
                
    print("Number of Categories:",len(os.listdir(path)))
    for i in range(len(os.listdir(path))):
        print("Number of Articles in", "'"+os.listdir(path)[i]+"'", "Category:",len(texts[i]))
    return texts

In [49]:
dataset_path = os.path.join("BBC News Summary", "BBC News Summary", "News Articles")
print("Dataset path:", dataset_path)
categorized_articles = read_files(dataset_path)

Dataset path: BBC News Summary/BBC News Summary/News Articles
Number of Categories: 5
Number of Articles in 'tech' Category: 401
Number of Articles in 'entertainment' Category: 386
Number of Articles in 'sport' Category: 511
Number of Articles in 'business' Category: 510
Number of Articles in 'politics' Category: 417


In [50]:
#Examplary text. The structure of the read file is: articles[category_no][document_no]. 
print(categorized_articles[0][0])

Bond game fails to shake or stir

For gaming fans, the word GoldenEye evokes excited memories not only of the James Bond revival flick of 1995, but also the classic shoot-em-up that accompanied it and left N64 owners glued to their consoles for many an hour.

Adopting that hallowed title somewhat backfires on this new game, for it fails to deliver on the promise of its name and struggles to generate the original's massive sense of fun. This however is not a sequel, nor does it relate to the GoldenEye film. You are the eponymous renegade spy, an agent who deserts to the Bond world's extensive ranks of criminal masterminds, after being deemed too brutal for MI6. Your new commander-in-chief is the portly Auric Goldfinger, last seen in 1964, but happily running around bent on world domination. With a determination to justify its name which is even less convincing than that of Tina Turner's similarly-titled theme song, the game literally gives the player a golden eye following an injury, wh

A) **Indexing** (preprocessing and indexing options)

In [51]:
#code, statistics and/or charts here

imports

In [52]:
import time 
from typing import Union
import nltk
import numpy as np
import math
import sklearn
from nltk.tokenize import RegexpTokenizer
from collections import Counter, defaultdict

In [53]:
# flatten list to get uncategorized collection 
def flatten(lists) -> list: 
    return [element for sublist in lists for element in sublist]

articles = flatten(categorized_articles)

# Inverted Index Structure 
 
Each term points to a dictionary of document identifier and the term frequency in the document.

t1 -> {doc1: TF, doc5: TF, ...}\
t2 -> {doc7: TF, doc8: TF, ...}\
...
t2 -> [DF, {doc7: [TF_(t2, doc7), {s1: TF, s4: TF, ...}], doc8: [TF_(t2, doc8), {s2: TF, s4: TF, ...}], ...}]\

use class structure

TODO: 
* Optimize structure?
    * Is there a more efficient way? 
    * Add maybe pointers to sentences and their term frequency? -> Faster?

In [61]:
class TermFrequencies: 
    def __init__(self, tf_d_t: int, sent_tf: dict) -> None:
        self.tf_d_t = 0
        self.sent_tf = dict()

    def get_sentence(self, s):
        return self.sent_tf.get(s, None)

In [62]:
class InvertedIndexInfo:
    def __init__(self) -> None:
        self.df_term = 0
        self.term_dict = dict()

    def get_document(self, d):
        return self.term_dict.get(d, None)

In [63]:
def indexing2(articles, args=None) -> Union[dict, int]: 
    start_time = time.time()
    inverted_index = dict()
    # tokenizer split words and keep hyphons e.g. state-of-the-art
    tokenizer = RegexpTokenizer(r'[\w|-]+')

    for article_id, article in enumerate(articles):
        # split into sentences
        sents = nltk.sent_tokenize(article)

        terms_per_sent = [tokenizer.tokenize(sent.lower()) for sent in sents]

        for sent_id, terms in enumerate(terms_per_sent): 
            tf_s = Counter(terms)
            for term in tf_s:
                if term not in inverted_index: 
                    index_info_of_term = InvertedIndexInfo()
                else: 
                    index_info_of_term = inverted_index[term]
                if 
                

                    

indexing2(articles) 

KeyboardInterrupt: 

In [8]:
'''
indexing(D,args)
    @input document collection D and optional arguments on text preprocessing

    @behavior preprocesses the collection and, using existing libraries, 
    builds an inverted index with the relevant statistics for the subsequent summarization functions
    
    @output pair with the inverted index I and indexing time
'''

def indexing(articles, args=None) -> Union[dict, int]:
    inverted_index = defaultdict(dict)
    start_time = time.time()
    # tokenizer split words and keep hyphons e.g. state-of-the-art
    tokenizer = RegexpTokenizer(r'[\w|-]+')

    # loop through collection 
    for article_id, article in enumerate(articles): 
        # split into sentences
        sents = nltk.sent_tokenize(article)
        # remove title (not needed for summarization task)
        sents = sents[1:]
        # save words per sent in list 
        words_per_sent = [tokenizer.tokenize(sent.lower()) for sent in sents]
        # add words to inverted index 
        word_counter_doc = Counter(flatten(words_per_sent))
        for word in word_counter_doc: 
            tf = word_counter_doc[word]
            inverted_index[word][article_id] = tf
    end_time = time.time()
    indexing_time = end_time - start_time
    return inverted_index, indexing_time

inverted_index, indexing_time = indexing(articles)
print(f"indexing time: {indexing_time} seconds") 


indexing time: 1.0544142723083496 seconds


# Summarization 

TF: 
* Document: Term frequencies are assessed on document level.
* Sentence: Term frequencies are assessed on sentence level.

IDF: Inverted document frequencies is assessed on collection level.\
\
Additional parameter "N" and "article_id". Is this allowed?

TODO: 
* Evaluate choice and give reason: 
    * IDF on document level?
    * TF on document level for sentences? 
* "order" parameter o
* BM25
* BERT embedding

In [9]:
'''
summarization(d,p,l,o,I,args)
    @input document d, maximum number of sentences (p) and/or characters (l), order
    of presentation o (appearance in text vs relevance), inverted index I or the
    collection D, and optional arguments on IR models

    @behavior preprocesses d, assesses the relevance of each sentence in d against I ac-
    cording to args, and presents them in accordance with p, l and o
    
    @output summary s of document d, i.e. ordered pairs (sentence position in d, score)
'''

def summarization(article: str, num_sent: int, order: str, inverted_index: dict, N: int, article_id: int, args=None) -> list:
    
    # tokenizer split words and keep hyphons e.g. state-of-the-art
    tokenizer = RegexpTokenizer(r'[\w|-]+')
    # tokenize sentences
    sents = nltk.sent_tokenize(article)
    # remove title (not needed for summarization task)
    sents = sents[1:]
    
    # get words per sentence 
    words_per_sent = [set(tokenizer.tokenize(sent.lower())) for sent in sents]

    # create sentence vector
    sent_vecs = list()
    for sent_id, words in enumerate(words_per_sent):
        # term frequency (tf) by sentence 
        sent_tf = dict(Counter(words))
        for word in sent_tf: 
            sent_tf[word] = 1 + math.log10(sent_tf[word]) 

        # create dense sentence vectors
        # no need for idf (see lecture)
        sent_vec = dict.fromkeys(inverted_index.keys(), 0)
        sent_vec.update(sent_tf)
        sent_vec = list(sent_vec.values())
        sent_vecs.append(sent_vec)

    # inverse document frequency (idf) and term frequency (tf) per document 
    words_per_doc = set(flatten(words_per_sent))
    doc_tf_idf = defaultdict(str)
    for word in words_per_doc: 
        # inverse document frequency (idf)
        idf = math.log10(N/len(inverted_index[word]))
        # term frequency (tf) by document
        # get it from inverted index 
        doc_tf = inverted_index[word][article_id]
        doc_tf = 1 + math.log10(doc_tf)

        # tf-idf for the document 
        doc_tf_idf[word] = doc_tf * idf 

    # create document vector
    doc_vec = dict.fromkeys(inverted_index.keys(), 0)
    doc_vec.update(doc_tf_idf)
    doc_vec = list(doc_vec.values())
    
    # cosine similarity 
    similarities = dict()
    for sent_id, sent_vec in enumerate(sent_vecs): 
        similarity = sklearn.metrics.pairwise_distances([sent_vec, doc_vec], metric="cosine")
        similarities[sent_id] = 1-similarity[0][1]
    similarities = dict(sorted(similarities.items(), key=lambda item: item[1], reverse=True)[:num_sent])

    for sent_id in similarities: 
        print(f"{similarities[sent_id]}: {sents[sent_id]}")
    return similarities  

0.2495193303958334: By far the most satisfying element of the game is seeing old favourites like Dr No, Goldfinger, hat-fiend Oddjob and crazed Russian sex beast Xenia Onatopp resurrected after all these years, and with their faces rendered in an impressively recognisable fashion.
0.21019493577531412: Rogue Agent signals its intentions by featuring James Bond initially and proceeding to kill him off within moments, squashed by a plummeting helicopter.
0.20079836196045564: You are the eponymous renegade spy, an agent who deserts to the Bond world's extensive ranks of criminal masterminds, after being deemed too brutal for MI6.
0.1995382299926325: With a determination to justify its name which is even less convincing than that of Tina Turner's similarly-titled theme song, the game literally gives the player a golden eye following an injury, which enables a degree of X-ray vision.
0.19033491063190966: Recent Bond games like Nightfire and Everything Or Nothing were very competent and did a

{19: 0.2495193303958334,
 5: 0.21019493577531412,
 2: 0.20079836196045564,
 4: 0.1995382299926325,
 7: 0.19033491063190966}

# Keyword Extraction

Calculates the keywords based on the tf-idf of the document.\
\
Additional parameter "N" and "article_id". Is this allowed?

Parameter for including only noun phrases. 

TODO:
* Nouns: just unigrams or also bigrams?
* BM25
* BERT embedding


In [10]:
from nltk.classify import Senna
from nltk.tag import SennaChunkTagger

In [11]:
'''
keyword extraction(d,p,I,args)
    @input document d, maximum number of keywords p, inverted index I, and op-
    tional arguments on IR model choices

    @behavior extracts the top informative p keywords in d against I according to args
    
    @output ordered set of p keywords
'''

'\nkeyword extraction(d,p,I,args)\n    @input document d, maximum number of keywords p, inverted index I, and op-\n    tional arguments on IR model choices\n\n    @behavior extracts the top informative p keywords in d against I according to args\n    \n    @output ordered set of p keywords\n'

In [42]:
def keyword_extraction(article: str, max_num: int, inverted_index: dict, N: int, article_id: int, use_only_nouns=False, args=None) -> dict: 
    # tokenizer split words and keep hyphons e.g. state-of-the-art
    tokenizer = RegexpTokenizer(r'[\w|-]+')
    # tokenize sentences
    sents = nltk.sent_tokenize(article)
    # remove title (not needed for summarization task)
    sents = sents[1:]
    
    # get words per document
    # either use all words or only nouns
    if use_only_nouns: 
        #tagger = SennaChunkTagger('/usr/share/senna-v3.0/senna')
        words_per_doc = list()
        for sent in sents: 
            words = tokenizer.tokenize(sent.lower())
            
            # senna chunk tagging of nouns 
            # not sure if needed, just a test
            #tags = tagger.tag(words)
            #chunks = list(tagger.bio_to_chunks(tags, chunk_type='N'))

            # nltk pos tag 
            tagged_words = nltk.pos_tag(words)
            named_entities = nltk.ne_chunk(tagged_words)
            for word, tag in named_entities:
                if 'NN' in tag: 
                    words_per_doc.append(word)
    else: 
        words_per_sent = [set(tokenizer.tokenize(sent.lower())) for sent in sents]
        words_per_doc = flatten(words_per_sent)

    doc_tf_idf = defaultdict(str)
    for word in words_per_doc: 
        # inverse document frequency (idf)
        idf = math.log10(N/len(inverted_index[word]))
        # term frequency (tf) by document
        # get it from inverted index 
        doc_tf = inverted_index[word][article_id]
        doc_tf = 1 + math.log10(doc_tf)

        # tf-idf for the document 
        doc_tf_idf[word] = doc_tf * idf
    
    doc_tf_idf = dict(sorted(doc_tf_idf.items(), key=lambda item: item[1], reverse=True)[:max_num])
    for word in doc_tf_idf: 
        print(f"{word}: {doc_tf_idf[word]}")
    print(doc_tf_idf)

keyword_extraction(articles[0], 10, inverted_index, len(articles), 0, use_only_nouns=True)
        


goldfinger: 4.354976755313726
goldeneye: 3.7342276913546106
bond: 3.6992505932864606
masterminds: 3.3473300153169503
commander-in-chief: 3.3473300153169503
juices: 3.3473300153169503
nightfire: 3.3473300153169503
aura: 3.3473300153169503
crosshair: 3.3473300153169503
firefights: 3.3473300153169503
{'goldfinger': 4.354976755313726, 'goldeneye': 3.7342276913546106, 'bond': 3.6992505932864606, 'masterminds': 3.3473300153169503, 'commander-in-chief': 3.3473300153169503, 'juices': 3.3473300153169503, 'nightfire': 3.3473300153169503, 'aura': 3.3473300153169503, 'crosshair': 3.3473300153169503, 'firefights': 3.3473300153169503}


# Evaluation

TODO:
* Implement evaluation
* Evaluation:
    * Statistics 
    * F-meassure
    * Recall-precision-curve
    * MAP
    * Efficiency

In [13]:
'''
evaluation(Sset,Rset,args)
    @input the set of summaries Sset produced from selected documents Dset ⊆ D
    (e.g. a single document, a category of documents, the whole collection),
    the corresponding reference extracts Rset, and optional arguments (evalu-
    ation, preprocessing, model options)

    @behavior assesses the produced summaries against the reference ones using the tar-
    get evaluation criteria

    @output evaluation statistics, including F-measuring at predefined p-or-l summary
    limits, recall-and-precision curves, MAP, and efficiency
'''

'\nevaluation(Sset,Rset,args)\n    @input the set of summaries Sset produced from selected documents Dset ⊆ D\n    (e.g. a single document, a category of documents, the whole collection),\n    the corresponding reference extracts Rset, and optional arguments (evalu-\n    ation, preprocessing, model options)\n\n    @behavior assesses the produced summaries against the reference ones using the tar-\n    get evaluation criteria\n\n    @output evaluation statistics, including F-measuring at predefined p-or-l summary\n    limits, recall-and-precision curves, MAP, and efficiency\n'

In [14]:
def evaluation(S: list, R: list, args=None) -> list:
    # do evaluation... 
    pass

B) **Summarization**

*B.1 Summarization solution: results for a given document*

In [45]:
#code, statistics and/or charts here
article_id = 0
print(articles[article_id])
summarization(articles[article_id], num_sent=5, order="rel", inverted_index=inverted_index, N=len(articles), article_id=article_id) 

Bond game fails to shake or stir

For gaming fans, the word GoldenEye evokes excited memories not only of the James Bond revival flick of 1995, but also the classic shoot-em-up that accompanied it and left N64 owners glued to their consoles for many an hour.

Adopting that hallowed title somewhat backfires on this new game, for it fails to deliver on the promise of its name and struggles to generate the original's massive sense of fun. This however is not a sequel, nor does it relate to the GoldenEye film. You are the eponymous renegade spy, an agent who deserts to the Bond world's extensive ranks of criminal masterminds, after being deemed too brutal for MI6. Your new commander-in-chief is the portly Auric Goldfinger, last seen in 1964, but happily running around bent on world domination. With a determination to justify its name which is even less convincing than that of Tina Turner's similarly-titled theme song, the game literally gives the player a golden eye following an injury, wh

{19: 0.2495193303958334,
 5: 0.21019493577531412,
 2: 0.20079836196045564,
 4: 0.1995382299926325,
 7: 0.19033491063190966}

*B.2 IR models (TF-IDF, BM25 and EBRT)*

In [16]:
#code, statistics and/or charts here

*B.3 Reciprocal rank funsion*

In [17]:
#code, statistics and/or charts here

*B.4 Maximal Marginal Relevance*

In [18]:
#code, statistics and/or charts here

C) **Keyword extraction**

In [46]:
#code, statistics and/or charts here
article_id = 0
print(articles[article_id])
keyword_extraction(articles[0], 10, inverted_index, len(articles), 0, use_only_nouns=True)

Bond game fails to shake or stir

For gaming fans, the word GoldenEye evokes excited memories not only of the James Bond revival flick of 1995, but also the classic shoot-em-up that accompanied it and left N64 owners glued to their consoles for many an hour.

Adopting that hallowed title somewhat backfires on this new game, for it fails to deliver on the promise of its name and struggles to generate the original's massive sense of fun. This however is not a sequel, nor does it relate to the GoldenEye film. You are the eponymous renegade spy, an agent who deserts to the Bond world's extensive ranks of criminal masterminds, after being deemed too brutal for MI6. Your new commander-in-chief is the portly Auric Goldfinger, last seen in 1964, but happily running around bent on world domination. With a determination to justify its name which is even less convincing than that of Tina Turner's similarly-titled theme song, the game literally gives the player a golden eye following an injury, wh

D) **Evaluation**

In [20]:
#code, statistics and/or charts here

<H3>Part II: questions materials (optional)</H3>

**(1)** Corpus *D* and summaries *S* description.

In [21]:
#code, statistics and/or charts here

**(2)** Summarization performance for the overall and category-conditional corpora.

In [22]:
#code, statistics and/or charts here

**...** (additional questions with empirical results)

<H3>END</H3>