<a href="https://colab.research.google.com/github/LukasEder1/ContrastiveKeywordExtraction/blob/main/demo/CKE-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstration: Contrastive Keyword Extraction from Versioned Documents

## Setup:

In [1]:
!pip install git+https://github.com/LukasEder1/ContrastiveKeywordExtraction

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/LukasEder1/ContrastiveKeywordExtraction
  Cloning https://github.com/LukasEder1/ContrastiveKeywordExtraction to /tmp/pip-req-build-4uqf25_w
  Running command git clone --filter=blob:none --quiet https://github.com/LukasEder1/ContrastiveKeywordExtraction /tmp/pip-req-build-4uqf25_w
  Resolved https://github.com/LukasEder1/ContrastiveKeywordExtraction to commit d621fcdf25e89c2e94d0a6de7ca19446c531af47
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers (from cke==0.1)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cdifflib (from cke==0.1)
  Downloading cdifflib-1.2.6.tar.gz (11 kB)
  Installing build dependencies ... [?25l[?2

In [2]:
!git clone https://github.com/LukasEder1/ContrastiveKeywordExtraction
%cd /content/ContrastiveKeywordExtraction/demo

Cloning into 'ContrastiveKeywordExtraction'...
remote: Enumerating objects: 152, done.[K
remote: Counting objects: 100% (152/152), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 152 (delta 68), reused 144 (delta 64), pack-reused 0[K
Receiving objects: 100% (152/152), 142.08 KiB | 1.43 MiB/s, done.
Resolving deltas: 100% (68/68), done.
/content/ContrastiveKeywordExtraction/demo


In [3]:
import pickle

from cke import extract_contrastive_keywords

import string
from cke.sentence_comparision import match_sentences_semantic_search, match_sentences_tfidf_weighted, detect_changes
from cke.sentence_importance import text_rank_importance, yake_weighted_importance
import nltk
import pandas as pd


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
with open("docs.pkl", "rb") as file:
  documents = pickle.load(file)

In [5]:
stopwords = nltk.corpus.stopwords.words("english")

## Contrastive Keyword Extraction Pipeline
The following notebook takes a closer look at every step of the pipeline depicted below. For people only intrested in extracting Contrastive Keywords for preset or custom versioned documents, please go ahead and skip all, but the last section.

A GUI-Interface is provided [here](https://contrastive-keyword-extraction.streamlit.app/).



<img src="https://github.com/LukasEder1/CKE_streamlit/blob/main/revamped.png?raw=true" alt="pipeline" />




### Document Selection
A small sample of versioned Documents is provided: all of these indices are printed out in the next cell.

The user can also choose any custom data. (Format: List with 2 entries)

In [6]:
documents.keys()

dict_keys([17313, 16159, 17736, 17748, 3299, 90232, 98445, 98447, 106601, 106604, 99880, 0, 1])

In [7]:
documents[0]

['In this paper, we introduce TextRank - a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for keyword and sentence extraction, and show that the results obtained compare favorably with previously published results on established benchmark.',
 'TextRank, a graph-based ranking system, is introduced in this paper. Ranking model for text processing, and demonstrate how this model can be used successfully in natural language processing applications. We propose two novel unsupervised methods for keyword and sentence extraction in particular, and demonstrate that the results obtained compare favorably with previously published results on established benchmarks.']

Replace this with your own list: [older_version, newer_version]

In [8]:
versioned_document = documents[17313]

### Sentence Matching

Sentence Matching deals with matching source sentences from the former version to sentences in the latter version. In order to find out wheter or not the overall sentences structure of the two versions changed. It is especially usefull in order to classify sentences as new, removed, unchanged or changed.

#### Auxiliary Functions

In [9]:
def get_matched_indices(matched_dict):
    """ Get indices of matched Sentences

    Args:
        matched_dict (dict):Keys: Indices of Document A,
                            Values: List of Pairs <Index of Document B| semantic similarity>

    Returns:
        List of all sentences in version B, that have been matched to
    """
    return [i for i in list(matched_dict.keys()) if len(matched_dict[i]) > 0]

In [10]:
def display_matches(matched_dict):
    original_indices = []
    matched_indices = []
    matched_score = []


    for i in get_matched_indices(matched_dict):
        original_indices += len(matched_dict[i]) * [i]
        for idx, score in  matched_dict[i]:
            matched_indices.append(int(idx))
            matched_score.append(float(score))

    return pd.DataFrame({"source sentence position": original_indices,
        "matched sentence position": matched_indices,
        "semantic similarity":matched_score}).reset_index(drop=True)

#### Semantic Search:


*   threshold: this parameter decides wheter or not sentences should match
*  k: number of possible splits
* model: matching model

Returns:
* matched_dict (dict): key: index (older version), value: dictonary, index, similarity key-value pairs (index of newer version)



In [11]:
matched_dict, removed = match_sentences_semantic_search(document_a=versioned_document[0],
                                                                             document_b=versioned_document[1],
                                                                             threshold=0.6,
                                                                             k=1,
                                                                             model="all-MiniLM-L6-v2")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [12]:
display_matches(matched_dict)

Unnamed: 0,source sentence position,matched sentence position,semantic similarity
0,0,0,1.0
1,1,1,0.655811
2,2,1,0.893363
3,3,2,1.0
4,4,3,0.98448
5,5,4,0.993359
6,6,5,0.998991
7,7,9,0.942158
8,8,10,1.0
9,9,11,1.0


### Change Detection

After matching all sentences, we now extract all additions/deletions between matched sentence pairs.

The returned variables are:
* changed_sentences (list): index (in older version) of all sentences, where some changed occured (including punctuation diffrences)
* new_sentences (list): all sentences that have not been matched to
* additions (dict): key=source index, value: dictonary of additions of all matched sentences (if num_splits > 1, it can contain multiple values)
* deletions (dict): Anlogous to additions
* matched_indices: all indices (latter version), that got matched to (can contain duplicates if multiple sentences merged into this one)
* unified_delitions: if k==1 then same as deletions,
  if k > 1 then union of all deletions, that a sentence split into

In [13]:
changed_sentences, new_sentences, additions, deletions, matched_indices, unified_delitions = detect_changes(matched_dict,
                                                                                                        versioned_document[0],
                                                                                                        versioned_document[1],
                                                                                                        max_ngram=2,
                                                                                                        show_output=True)

query: WASHINGTON (AP) -- Federal agents who raided the office of President Donald Trump's personal attorney, Michael Cohen, were looking for information about payments to a former Playboy playmate and a porn actress who claim to have had affairs with Trump, two people familiar with the investigation said.
 
matched: WASHINGTON (AP) -- Federal agents who raided the office of President Donald Trump's personal attorney, Michael Cohen, were looking for information about payments to a former Playboy Playmate and a porn actress who claim to have had affairs with Trump, two people familiar with the investigation said.
 
Semantic Resemblence: 1.0000
Syntactic Resemblence: 0.9371

added in newer version:[]
deleted from older version: []
------------------------------------------------------------------------------

query: Public corruption prosecutors in the U.S.

matched: Public corruption prosecutors in the U.S. attorney's office in Manhattan are trying to determine if there was any fraud re

In [14]:
print("Classification of All Sentences:")
print(f"New (Index in Newer Version): {new_sentences}")
print(f"Deleted (Index in Older Version): {removed}")
print(f"Changed (Index in Older Version): {changed_sentences}")
print(f"Changed (Index in Newer Version): {list(set(matched_indices))}")

Classification of All Sentences:
New (Index in Newer Version): [34, 37, 6, 7, 8, 38, 39, 40, 41, 42, 43]
Deleted (Index in Older Version): [14, 15, 16, 17, 27, 36]
Changed (Index in Older Version): [0, 1, 2, 3, 4, 5, 6, 7, 13, 18, 22, 25, 29, 31, 35, 37, 42]
Changed (Index in Newer Version): [0, 1, 2, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 44, 45]


### Sentence Importance

Computes the importance of a sentence in its respective version. Will later be used to rank keywords.

Returns:

* ranking (dict): key: sentence position, value: importance score

In [15]:
ranking = yake_weighted_importance(versioned_document)

In [16]:
from IPython.display import display_html

def display_importance(ranking, k):

  df1 = pd.DataFrame({"Position":ranking[0].keys(), "Importance": ranking[0].values()}).reset_index(drop=True)
  df2 = pd.DataFrame({"Position":ranking[1].keys(), "Importance": ranking[1].values()}).reset_index(drop=True)

  print("Importance Older Version")
  display(df1.head(k))

  print("\nImportance Newer Version")
  display(df2.head(k))

In [17]:
display_importance(ranking, k=10)

Importance Older Version


Unnamed: 0,Position,Importance
0,27,0.110109
1,0,0.092421
2,13,0.086153
3,31,0.082445
4,8,0.078697
5,1,0.063164
6,34,0.057881
7,21,0.047157
8,20,0.041676
9,33,0.039618



Importance Newer Version


Unnamed: 0,Position,Importance
0,36,0.121273
1,28,0.107062
2,0,0.076276
3,32,0.071495
4,24,0.069186
5,11,0.051056
6,21,0.050195
7,15,0.046104
8,3,0.040298
9,26,0.037466


### Extract Contrastive Keywords

We have now come the main focus of this demo: Extracting Keywords, that incapsulate the diffrences between the two document version

The Keywords are split into 3 sets:
* Former Keywords: Keywords regarding the older version
* Latter Keywords: Keywords regarding the newer version
* Combined Keywords: Combined of the above two sets

In [20]:
threshold = 0.6
# Choose any model: https://www.sbert.net/examples/applications/semantic-search/README.html
model = 'all-MiniLM-L6-v2'
num_splits = 1
max_ngram = 2


combined_kws, former_kws, latter_kws = extract_contrastive_keywords(versioned_document[0],
                                                                    versioned_document[1],
                                                                    max_ngram=max_ngram, # Maximum n-gram size of Keywords
                                                                    min_ngram=1,
                                                                    extra_stopwords=stopwords, # Remove english Stopwords ([] = do not consider any stopwords)
                                                                    importance_estimator= text_rank_importance,  # alt: yake_weighted_importance
                                                                    match_sentences=match_sentences_semantic_search, # alt: match_sentences_tfidf_weighted
                                                                    threshold=threshold, # Matching Threshold
                                                                    symbols_to_remove=string.punctuation, # Remove certain Symbols
                                                                    matching_model=model, # Matching Model: Only relevant for Semantic Search
                                                                    num_splits=num_splits, # Max Number of Sentences a Sentence can possibly split into
                                                                    num_keywords=10
                                                                    )

#### Inspect the Keywords


In [23]:
def display_keywords(keywords):
  display(pd.DataFrame({"Keyword": keywords.keys(), "Score": keywords.values()}))

In [24]:
display_keywords(former_kws)

Unnamed: 0,Keyword,Score
0,attorneyclient privilege,0.146228
1,fbi agents,0.113967
2,fire mueller,0.107559
3,dead,0.090321
4,furious president,0.090321
5,president blasted,0.090321
6,blasted displeasure,0.090321
7,displeasure early,0.090321
8,early tuesday,0.090321
9,tuesday saying,0.090321


In [25]:
display_keywords(latter_kws)

Unnamed: 0,Keyword,Score
0,state transportation,0.164149
1,transportation taxes,0.164149
2,new york,0.120413
3,records show,0.111552
4,york city,0.100616
5,city yellow,0.100616
6,taxes,0.079266
7,also sought,0.064811
8,medallions,0.048428
9,pleaded guilty,0.046


In [26]:
display_keywords(combined_kws)

Unnamed: 0,Keyword,Score
0,state transportation,0.162182
1,transportation taxes,0.162182
2,new york,0.119013
3,records show,0.110215
4,york city,0.09941
5,city yellow,0.09941
6,taxes,0.078316
7,also sought,0.064034
8,attorneyclient privilege,0.057391
9,medallions,0.047848
