# Experiment 1: K-Means using averaged fastText embeddings

In this experiment, summaries are generated by running K-Means clustering on the emedded sentences of a document. The length of the summary is determined by the number of clusters *k*, where *k* equals to the desired number of sentences in the summary.
Sentence embeddings are obtained by averaging the individual fastText embeddings of the words in the sentence.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge

In [2]:
from Fasttext import FTEmbedder
from Preprocessors import StandardPreprocessor
from Evaluator import USEevaluator
from models.unsupervised import kMeans

In [3]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [4]:
test_data = test_data.sort_values(by=['Language'])

In [5]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False
9864,9864,Governments of nearby nations face similar cri...,The ouster of Tunisia 's longtime ruler has ca...,English,9864,False
9871,9871,"NEW : An opposition leader says crackdown , no...",A Bahrain court sentenced eight Shiite opposit...,English,9871,False
9880,9880,"NEW : Death toll could reach 50,000 , accordin...","SICHUAN , China Li Yunxia wipes away tears as ...",English,9880,False
9881,9881,Isobel Coleman : Obama mainly addressed domest...,President Obama 's State of the Union address ...,English,9881,False


In [6]:
summarizer = kMeans(FTEmbedder, StandardPreprocessor)

In [7]:
comparator = USEevaluator(metric="cosine")

In [8]:
summaries = []
cosims = []

In [9]:
flatdict = {}
rouge = Rouge()

In [10]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2, sif=False)
    except:
        smry = " "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist
    cosims.append(comparator.compare(smry, row.Lead))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

Loading embeddings for English
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading embeddings for French
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))


Loading embeddings for German
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  n_local_trials = 2 + int(np.log(n_clusters))





In [11]:
test_data["Summary_Fasttext_Mean"] = summaries

In [12]:
test_data.to_pickle('./training_data/test_raw.pkl')

In [13]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [14]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [16]:
test_data["cosine_sim"] = cosims

In [17]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_Fasttext_Mean,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r,cosine_sim
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,"""The pledged assets are with Rosneft, so it wi...",0.449704,0.904762,0.299213,0.419162,0.853659,0.277778,0.539683,0.894737,0.386364,0.594428
9864,9864,Governments of nearby nations face similar cri...,The ouster of Tunisia 's longtime ruler has ca...,English,9864,False,State-run media reported at least three people...,0.200837,0.128342,0.461538,0.033755,0.021505,0.078431,0.190476,0.126984,0.380952,0.384861
9871,9871,"NEW : An opposition leader says crackdown , no...",A Bahrain court sentenced eight Shiite opposit...,English,9871,False,She was arrested . Rights groups have urged Ba...,0.314286,0.261905,0.392857,0.072464,0.060241,0.090909,0.259259,0.222222,0.311111,0.515974
9880,9880,"NEW : Death toll could reach 50,000 , accordin...","SICHUAN , China Li Yunxia wipes away tears as ...",English,9880,False,"Watch parents ' anguished vigil "" The death to...",0.274725,0.195312,0.462963,0.055556,0.03937,0.09434,0.281879,0.212121,0.42,0.589803
9881,9881,Isobel Coleman : Obama mainly addressed domest...,President Obama 's State of the Union address ...,English,9881,False,"On North Korea , boilerplate promises to isola...",0.26506,0.207547,0.366667,0.073171,0.057143,0.101695,0.267717,0.209877,0.369565,0.54549


In [18]:
test_data.R2_f.describe()

count    8430.000000
mean        0.060019
std         0.097063
min         0.000000
25%         0.000000
50%         0.027397
75%         0.069767
max         0.958333
Name: R2_f, dtype: float64

In [19]:
test_data.R2_p.describe()

count    8430.000000
mean        0.059953
std         0.131974
min         0.000000
25%         0.000000
50%         0.019231
75%         0.051842
max         1.000000
Name: R2_p, dtype: float64

In [20]:
test_data.R2_r.describe()

count    8430.000000
mean        0.086257
std         0.110381
min         0.000000
25%         0.000000
50%         0.050000
75%         0.122807
max         1.000000
Name: R2_r, dtype: float64

In [21]:
test_data.cosine_sim.describe()

count    8430.000000
mean        0.484585
std         0.153306
min        -0.089866
25%         0.384060
50%         0.492526
75%         0.593473
max         0.998148
Name: cosine_sim, dtype: float64