# Experiment 2: K-Means using SIF-weighted fastText embeddings

In this experiment, summaries are generated by running K-Means clustering on the emedded sentences of a document. The length of the summary is determined by the number of clusters *k*, where *k* equals to the desired number of sentences in the summary.
Sentence embeddings are obtained as the average of the individual fastText word embeddings, weighted by the smooth inverse frequencies.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge

In [2]:
from Fasttext import FTEmbedder
from Preprocessors import StandardPreprocessor
from Evaluator import USEevaluator
from models.unsupervised import kMeans

In [3]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [4]:
test_data = test_data.sort_values(by=['Language'])

In [5]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_CRSum
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,Yukos unit buyer faces loan claim The owners o...
9455,9455,"Deputy foreign policy chiefs from Iran , EU wi...",Envoys from Iran and the European Union will m...,English,9455,False,Envoys from Iran and the European Union will m...
9456,9456,Isaac hit Louisiana as a hurricane and lingere...,"For Urban Treuil , there 's no escaping the mi...",English,9456,False,"Because of Hurricane Isaac , Treuil 's home in..."
9459,9459,Hundreds of officers resume search hours after...,Hundreds of law enforcement officers searched ...,English,9459,False,Hundreds of law enforcement officers searched ...
9462,9462,Martha Burk : A decade of protests opened Augu...,Dividing up the newspapers on a recent weekend...,English,9462,False,This week 's Masters Golf Tournament marks the...


In [6]:
summarizer = kMeans(FTEmbedder, StandardPreprocessor)
comparator = USEevaluator(metric="cosine")

In [7]:
summaries = []
cosims = []

In [8]:
flatdict = {}
rouge = Rouge()

In [9]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2, sif=True)
    except:
        smry = " "
    if len(smry)<5:
        smry=" "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist
    cosims.append(comparator.compare(smry, row.Lead))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

Loading embeddings for English
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading embeddings for French
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))
  self.explained_variance_ratio_ = exp_var / full_var
  n_local_trials = 2 + int(np.log(n_clusters))
  n_local_trials = 2 + int(np.log(n_clusters))


Loading embeddings for German
Done.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





In [10]:
test_data["Summary_Fasttext_SIF"] = summaries

In [11]:
test_data.to_pickle('./training_data/test_raw.pkl')

In [12]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [13]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [14]:
test_data["cosine_sim"] = cosims

In [15]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_CRSum,Summary_Fasttext_SIF,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r,cosine_sim
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,Yukos unit buyer faces loan claim The owners o...,Rosneft officials were unavailable for comment...,0.179487,0.482759,0.110236,0.012987,0.035714,0.007937,0.173913,0.37037,0.113636,0.401247
9455,9455,"Deputy foreign policy chiefs from Iran , EU wi...",Envoys from Iran and the European Union will m...,English,9455,False,Envoys from Iran and the European Union will m...,Iranian Foreign Ministry spokeswoman Marziyeh ...,0.266667,0.304348,0.237288,0.038835,0.044444,0.034483,0.282609,0.295455,0.270833,0.469877
9456,9456,Isaac hit Louisiana as a hurricane and lingere...,"For Urban Treuil , there 's no escaping the mi...",English,9456,False,"Because of Hurricane Isaac , Treuil 's home in...","For Urban Treuil , there 's no escaping the mi...",0.21164,0.153846,0.338983,0.010695,0.007752,0.017241,0.184211,0.138614,0.27451,0.463949
9459,9459,Hundreds of officers resume search hours after...,Hundreds of law enforcement officers searched ...,English,9459,False,Hundreds of law enforcement officers searched ...,"On Friday night , police had surrounded an are...",0.2875,0.214953,0.433962,0.088608,0.066038,0.134615,0.292308,0.223529,0.422222,0.470378
9462,9462,Martha Burk : A decade of protests opened Augu...,Dividing up the newspapers on a recent weekend...,English,9462,False,This week 's Masters Golf Tournament marks the...,This week 's Masters Golf Tournament marks the...,0.258555,0.178947,0.465753,0.076628,0.05291,0.138889,0.237288,0.168,0.403846,0.600608


In [16]:
test_data.R2_f.describe()

count    8430.000000
mean        0.040303
std         0.074668
min         0.000000
25%         0.000000
50%         0.014925
75%         0.045096
max         1.000000
Name: R2_f, dtype: float64

In [17]:
test_data.R2_p.describe()

count    8430.000000
mean        0.043226
std         0.102228
min         0.000000
25%         0.000000
50%         0.010399
75%         0.038462
max         1.000000
Name: R2_p, dtype: float64

In [18]:
test_data.R2_r.describe()

count    8430.000000
mean        0.054109
std         0.087455
min         0.000000
25%         0.000000
50%         0.025641
75%         0.072727
max         1.000000
Name: R2_r, dtype: float64

In [19]:
test_data.cosine_sim.describe()

count    8430.000000
mean        0.418553
std         0.172641
min        -0.085581
25%         0.300473
50%         0.428596
75%         0.543111
max         1.000000
Name: cosine_sim, dtype: float64