# Recursive Clustering and Summarization

Plan:
- recursively cluster collections
- create tree of clusters (the HDBSCAN does this anyways but likely not as we want)
- cluster until max-depth is reached or (better) until each leaf only has one "plausible" cluster (based on thresholds or probabilities)
- try summarizing to get "main idea" out of cluster


## Recursively cluster 

Based on the topic_clustering notebook, we will try with Agglomerative Clustering 

In [3]:
import pandas as pd

df = pd.read_csv("downloads/40k_balanced_pm_acl.csv")#.sample(frac=0.5)

In [10]:
pos = df[df.labels == 1]
sentences = list(pos["text"]) #otherwise key error

In [11]:
from sentence_transformers import SentenceTransformer, util

print("Encode the corpus ... get a coffee in the meantime")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

Encode the corpus ... get a coffee in the meantime


Batches:   0%|          | 0/342 [00:00<?, ?it/s]

In [50]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np

def cluster(embeddings, **kwargs):
    embeddings = embeddings.cpu()
    # Normalize the embeddings to unit length
    corpus_embeddings = embeddings /  np.linalg.norm(embeddings, axis=1, keepdims=True)

    # Perform kmean clustering
    clustering_model = AgglomerativeClustering(**kwargs) #, affinity='cosine', linkage='average', distance_threshold=0.4)
    clustering_model.fit(corpus_embeddings)
   # cluster_assignment = clustering_model.labels_
    return clustering_model

In [51]:
def get_clusters(clustering_model):
    
    clusters = {}
    for sentence_id, cluster_id in enumerate(clustering_model.labels_):
        if cluster_id not in clusters:
            clusters[cluster_id] = []
        clusters[cluster_id].append(sentences[sentence_id])
    return clusters
   
#     for i, cluster in clustered_sentences.items():
#         print("Cluster ", i+1)
#         print(cluster)
#         print("\n")

In [52]:
sample = embeddings[:5000]

In [54]:
cluster_model = cluster(sample,n_clusters=None, distance_threshold=1.2)

Cluster Model attributes

   n_clusters_ : int
        The number of clusters found by the algorithm. If
        ``distance_threshold=None``, it will be equal to the given
        ``n_clusters``.

    labels_
    n_leaves_

    n_connected_components_ : The estimated number of connected components in the graph.

    children_ : array-like of shape (n_samples-1, 2)
        The children of each non-leaf node. Values less than `n_samples`
        correspond to leaves of the tree which are the original samples.
        A node `i` greater than or equal to `n_samples` is a non-leaf
        node and has children `children_[i - n_samples]`. Alternatively
        at the i-th iteration, children[i][0] and children[i][1]
        are merged to form node `n_samples + i`

    distances_ : array-like of shape (n_nodes-1,)
        Distances between nodes in the corresponding place in `children_`.
        Only computed if `distance_threshold` is used or `compute_distances`
        is set to `True`.

In [55]:
cluster_model.children_

array([[ 175,  243],
       [2079, 2652],
       [4746, 4862],
       ...,
       [9987, 9992],
       [9995, 9996],
       [9994, 9997]])

In [27]:
import itertools

ii = itertools.count(sample.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in cluster_model.children_]


[{'node_id': 1000, 'left': 175, 'right': 243},
 {'node_id': 1001, 'left': 506, 'right': 770},
 {'node_id': 1002, 'left': 41, 'right': 70},
 {'node_id': 1003, 'left': 599, 'right': 676},
 {'node_id': 1004, 'left': 425, 'right': 460},
 {'node_id': 1005, 'left': 556, 'right': 620},
 {'node_id': 1006, 'left': 110, 'right': 172},
 {'node_id': 1007, 'left': 616, 'right': 956},
 {'node_id': 1008, 'left': 644, 'right': 767},
 {'node_id': 1009, 'left': 835, 'right': 916},
 {'node_id': 1010, 'left': 868, 'right': 1008},
 {'node_id': 1011, 'left': 726, 'right': 727},
 {'node_id': 1012, 'left': 827, 'right': 936},
 {'node_id': 1013, 'left': 187, 'right': 189},
 {'node_id': 1014, 'left': 31, 'right': 436},
 {'node_id': 1015, 'left': 545, 'right': 575},
 {'node_id': 1016, 'left': 647, 'right': 659},
 {'node_id': 1017, 'left': 267, 'right': 403},
 {'node_id': 1018, 'left': 816, 'right': 944},
 {'node_id': 1019, 'left': 803, 'right': 838},
 {'node_id': 1020, 'left': 592, 'right': 924},
 {'node_id': 10

In [28]:
dict(enumerate(cluster_model.children_, model.n_leaves_))

{5: array([175, 243]),
 6: array([506, 770]),
 7: array([41, 70]),
 8: array([599, 676]),
 9: array([425, 460]),
 10: array([556, 620]),
 11: array([110, 172]),
 12: array([616, 956]),
 13: array([644, 767]),
 14: array([835, 916]),
 15: array([ 868, 1008]),
 16: array([726, 727]),
 17: array([827, 936]),
 18: array([187, 189]),
 19: array([ 31, 436]),
 20: array([545, 575]),
 21: array([647, 659]),
 22: array([267, 403]),
 23: array([816, 944]),
 24: array([803, 838]),
 25: array([592, 924]),
 26: array([351, 378]),
 27: array([624, 823]),
 28: array([517, 585]),
 29: array([ 571, 1019]),
 30: array([589, 852]),
 31: array([520, 700]),
 32: array([103, 157]),
 33: array([ 388, 1021]),
 34: array([801, 833]),
 35: array([927, 955]),
 36: array([744, 781]),
 37: array([900, 986]),
 38: array([794, 856]),
 39: array([684, 793]),
 40: array([ 13, 126]),
 41: array([558, 694]),
 42: array([ 65, 109]),
 43: array([202, 370]),
 44: array([ 469, 1017]),
 45: array([696, 743]),
 46: array([582

## Summarization

**Tried: Google Pegasus**. Result: Does a terrible job of keeping the important information and doesn't retain the question but guesses at a conclusion


In [68]:
torch.cuda.is_available()

True

### Pegasus Setup


In [64]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

model_name = 'google/pegasus-xsum'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)


In [66]:

def summarize(sentences):
    batch = tokenizer(sentences, truncation=True, padding='longest', return_tensors="pt").to(device)
    translated = model.generate(**batch)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

In [62]:
tgt_text[0]

"California's largest electricity provider has turned off power to hundreds of thousands of customers."

### T5 Setup

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ Background: Trust is a critical component of competency committees given their high-stakes decisions. Research from outside of medicine on group trust has not focused on trust in group decisions, and "group trust" has not been clearly defined. The purpose was twofold: to examine the definition of trust in the context of group decisions and to explore what factors may influence trust from the perspective of those who rely on competency committees through a proposed group trust model. Methods: The authors conducted a literature search of four online databases, seeking articles published on trust in group settings. Reviewers extracted, coded, and analyzed key data including definitions of trust and factors pertaining to group trust. Results: The authors selected 42 articles for full text review. Although reviewers found multiple general definitions of trust, they were unable to find a clear definition of group trust and propose the following: a group-directed willingness to accept vulnerability to actions of the members based on the expectation that members will perform a particular action important to the group, encompassing social exchange, collective perceptions, and interpersonal trust. Additionally, the authors propose a model encompassing individual level factors (trustor and trustee), interpersonal interactions, group level factors (structure and processes), and environmental factors."""
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("snrspeaks/t5-one-line-summary") #snrspeaks/t5-one-line-summary
tokenizer = AutoTokenizer.from_pretrained("snrspeaks/t5-one-line-summary")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
tokenizer.decode(outputs[0])
def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

### Print Results

In [70]:
for ID, cluster in get_clusters(cluster_model).items():
    sentences = ".".join(cluster)
    print(sentences, "\n\n", "sum:::", summarize(sentences), "\n\n\n")
    

The difficulty with this task lies in the fact that prosodic cues are never absolute ; they are relative to individual speakers , gender , dialect , discourse context , local context , phonological environment , and many other factors.Apart from system delay , another current limitation that will influence future interactive speech systems is the unavailability of full prosodic analysis.One obvious shortcoming is that some information gets lost in the thresholding that converts posterior probabilities from the prosodic model and the auxiliary LM into binary features.One major time and cost limitation in developing LVCSR systems in Indian languages is the need for large training data 

 sum::: ['The aim of this paper is to develop a speech recognition system (LVCSR) based on prosodic cues.'] 



The problem with rich annotations is that they increase the state space of the grammar substantially.The only shortcoming is the cost of annotation.But we also believe that ultimately this issue

KeyboardInterrupt: 