# Experiment 3: K-Means using Universal Sentence Encoder

In this experiment, summaries are generated by running K-Means clustering on the emedded sentences of a document. The length of the summary is determined by the number of clusters *k*, where *k* equals to the desired number of sentences in the summary.
Sentence embeddings are obtained from Google's Universal Sentence Encoder.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge 

In [2]:
from models.unsupervised import kMeans

In [3]:
from UniversalSentenceEncoder import USEEmbedder
from Preprocessors import PlaceboPreprocessor

In [4]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [5]:
test_data = test_data.sort_values(by=['Language'])

In [6]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False


In [7]:
summarizer = kMeans(USEEmbedder, PlaceboPreprocessor)

In [8]:
summaries = []

In [9]:
flatdict = {}
rouge = Rouge()

In [10]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2, sif=True)
    except:
        smry = " "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))





In [11]:
test_data["Summary_USE"] = summaries

In [12]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [13]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [14]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_USE,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,State-owned Rosneft bought the Yugansk unit fo...,0.523256,1.0,0.354331,0.505882,0.977273,0.34127,0.625,1.0,0.454545
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False,"Can I touch it ? "" It would not be someone 's ...",0.458015,0.434783,0.483871,0.310078,0.294118,0.327869,0.411765,0.381818,0.446809
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False,Space shuttle astronaut Randy Bresnik has welc...,0.351648,0.251969,0.581818,0.122222,0.087302,0.203704,0.363636,0.276596,0.530612
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False,Attorney General Eric Holder announced the wra...,0.34965,0.277778,0.471698,0.156028,0.123596,0.211538,0.278261,0.231884,0.347826
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False,A winter storm blasting the Northeast caused s...,0.304,0.275362,0.339286,0.065041,0.058824,0.072727,0.196078,0.181818,0.212766


In [15]:
test_data.R2_f.describe()

count    8430.000000
mean        0.064693
std         0.100974
min         0.000000
25%         0.000000
50%         0.031056
75%         0.073171
max         1.000000
Name: R2_f, dtype: float64