# Experiment 2: K-Means using SIF-weighted fastText embeddings

In this experiment, summaries are generated by running K-Means clustering on the emedded sentences of a document. The length of the summary is determined by the number of clusters *k*, where *k* equals to the desired number of sentences in the summary.
Sentence embeddings are obtained as the average of the individual fastText word embeddings, weighted by the smooth inverse frequencies.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge

In [2]:
from Fasttext import FTEmbedder
from Preprocessors import StandardPreprocessor
from Evaluator import USEevaluator
from models.unsupervised import kMeans

In [3]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [4]:
test_data = test_data.sort_values(by=['Language'])

In [5]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_Fasttext_Mean
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,"""The pledged assets are with Rosneft, so it wi..."
34,34,"But last year, he was forced to resign from al...",Japanese mogul arrested for fraud One of Japan...,English,34,False,Inheriting a large property business from his ...
40,40,Japan's economy grew 2.6% overall last year - ...,Japan economy slides to recession The Japanese...,English,40,False,The Tokyo stock market fell after the figures ...
62,62,Both Boeing and Airbus have been taking orders...,Boeing unveils new 777 aircraft US aircraft fi...,English,62,False,"""Boeing has the latest variant in a very succe..."
71,71,"The biggest slice of the 246,570 ID fraud case...",ID theft surge hits US consumers Almost a quar...,English,71,False,Another 18% came from attempts to rip off peop...


In [6]:
summarizer = kMeans(FTEmbedder, StandardPreprocessor)
comparator = USEevaluator(metric="cosine")

In [7]:
summaries = []
cosims = []

In [8]:
flatdict = {}
rouge = Rouge()

In [None]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2, sif=True)
    except:
        smry = " "
    if len(smry)<5:
        smry=" "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist
    cosims.append(comparator.compare(smry, row.Lead))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

Loading embeddings for English
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
test_data["Summary_Fasttext_SIF"] = summaries

In [None]:
test_data.to_pickle('./training_data/test_raw.pkl')

In [None]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [None]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [None]:
test_data["cosine_sim"] = cosims

In [None]:
test_data.head()

In [None]:
test_data.R2_f.describe()

In [None]:
test_data.R2_p.describe()

In [None]:
test_data.R2_r.describe()

In [None]:
test_data.cosine_sim.describe()