# Experiment 2: K-Means using SIF-weighted fastText embeddings

In this experiment, summaries are generated by running K-Means clustering on the emedded sentences of a document. The length of the summary is determined by the number of clusters *k*, where *k* equals to the desired number of sentences in the summary.
Sentence embeddings are obtained as the average of the individual fastText word embeddings, weighted by the smooth inverse frequencies.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge

In [2]:
from Fasttext import FTEmbedder
from Preprocessors import StandardPreprocessor
from models.unsupervised import kMeans

In [3]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [4]:
test_data = test_data.sort_values(by=['Language'])

In [5]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False


In [6]:
summarizer = kMeans(FTEmbedder, StandardPreprocessor)

In [7]:
summaries = []

In [8]:
flatdict = {}
rouge = Rouge()

In [9]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2, sif=True)
    except:
        smry = " "
    if len(smry)<5:
        smry=" "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

Loading embeddings for English
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading embeddings for French
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  n_local_trials = 2 + int(np.log(n_clusters))
  self.explained_variance_ratio_ = exp_var / full_var
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  self.explained_variance_ratio_ = exp_var / full_var
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  self.explained_variance_ratio_ = exp_var / full_var
  n_local_trials = 2 + int(np.log(n_clusters))


Loading embeddings for German
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  n_local_trials = 2 + int(np.log(n_clusters))





In [10]:
test_data["Summary_Fasttext"] = summaries

In [11]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [12]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [13]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_Fasttext,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,Rosneft officials were unavailable for comment...,0.177215,0.451613,0.110236,0.0,0.0,0.0,0.189655,0.392857,0.125
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False,"@Remzophilos : the good lord , Jesus of nazare...",0.444444,0.472727,0.419355,0.295652,0.314815,0.278689,0.387097,0.391304,0.382979
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False,Bresnik called Mission Control at 6:14 a.m. Th...,0.325581,0.239316,0.509091,0.129412,0.094828,0.203704,0.347826,0.269663,0.489796
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False,Attorney General Eric Holder announced the wra...,0.327586,0.301587,0.358491,0.157895,0.145161,0.173077,0.25,0.24,0.26087
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False,"ET on its website . The travel situation , how...",0.25,0.34375,0.196429,0.023256,0.032258,0.018182,0.16,0.214286,0.12766


In [14]:
test_data.R2_f.describe()

count    8430.000000
mean        0.039346
std         0.073396
min         0.000000
25%         0.000000
50%         0.014815
75%         0.045274
max         0.925926
Name: R2_f, dtype: float64