# Experiment 4: CRSum using fastText word embeddings

In this experiment, summaries are generated by a CRSum model. CRSum is an atttention neural network trained to predict the cosine similarity of a sentence to a hypothetical summary. The actual summary is obtained by selecting the *n* sentences with the highest predicted similarity, where *n* is the desired number of sentences in the summary. The model is trained on pre-trained aligned fasText word embeddings (https://fasttext.cc/docs/en/aligned-vectors.html). Sentence embeddings are generated implicitly through the hidden layers of the model.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge 

In [2]:
from models.supervised import CRSum

In [3]:
from Preprocessors import CRSumPreprocessor
from Evaluator import USEevaluator

In [4]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [5]:
test_data = test_data.sort_values(by=['Language'])

In [8]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False
3,3,Pernod has reduced the debt it took on to fund...,Pernod takeover talk lifts Domecq Shares in UK...,English,3,False
14,14,"On Tuesday, the company's administrator, turna...","Parmalat boasts doubled profits Parmalat, the ...",English,14,False
15,15,India's rupee has hit a five-year high after S...,India's rupee hits five-year high India's rupe...,English,15,False
23,23,The affected vehicles in the product recall ar...,Safety alert as GM recalls cars The world's bi...,English,23,False


In [9]:
summarizer = CRSum(embedding_model=None, preprocessor=CRSumPreprocessor, M=5, N=5, verbose=False)

In [11]:
summarizer.loadWeights("best_model.h5")

In [12]:
comparator = USEevaluator(metric="cosine")

In [13]:
summaries = []
cosims = []

In [14]:
flatdict = {}
rouge = Rouge()

In [15]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2)
    except:
        smry = " "
    if smry == "":
        smry = " "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist
    cosims.append(comparator.compare(smry, row.Lead))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





In [17]:
test_data["Summary_CRSum"] = summaries

In [18]:
test_data.to_pickle('./training_data/test_raw.pkl')

In [19]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [20]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [21]:
test_data["cosine_sim"] = cosims

In [22]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_CRSum,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r,cosine_sim
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,Yukos unit buyer faces loan claim The owners o...,0.404372,0.660714,0.291339,0.276243,0.454545,0.198413,0.442748,0.674419,0.329545,0.647187
3,3,Pernod has reduced the debt it took on to fund...,Pernod takeover talk lifts Domecq Shares in UK...,English,3,False,Pernod takeover talk lifts Domecq Shares in UK...,0.390533,0.634615,0.282051,0.299401,0.490196,0.215517,0.424242,0.622222,0.321839,0.599855
14,14,"On Tuesday, the company's administrator, turna...","Parmalat boasts doubled profits Parmalat, the ...",English,14,False,Less welcome was the news that the firm had be...,0.444444,0.652174,0.337079,0.360902,0.533333,0.272727,0.464286,0.619048,0.371429,0.596636
15,15,India's rupee has hit a five-year high after S...,India's rupee hits five-year high India's rupe...,English,15,False,India's rupee hits five-year high India's rupe...,0.45098,0.511111,0.403509,0.34,0.386364,0.303571,0.481928,0.526316,0.444444,0.720971
23,23,The affected vehicles in the product recall ar...,Safety alert as GM recalls cars The world's bi...,English,23,False,Safety alert as GM recalls cars The world's bi...,0.64486,0.945205,0.489362,0.603774,0.888889,0.457143,0.717647,0.938462,0.580952,0.766418


In [23]:
test_data.R2_f.describe()

count    8430.000000
mean        0.090013
std         0.126087
min         0.000000
25%         0.018265
50%         0.048780
75%         0.105263
max         1.000000
Name: R2_f, dtype: float64

In [24]:
test_data.R2_p.describe()

count    8430.000000
mean        0.084337
std         0.164619
min         0.000000
25%         0.011331
50%         0.031982
75%         0.074627
max         1.000000
Name: R2_p, dtype: float64

In [25]:
test_data.R2_r.describe()

count    8430.000000
mean        0.151039
std         0.151677
min         0.000000
25%         0.040000
50%         0.111111
75%         0.214286
max         1.000000
Name: R2_r, dtype: float64

In [26]:
test_data.cosine_sim.describe()

count    8430.000000
mean        0.549407
std         0.155028
min        -0.044249
25%         0.449834
50%         0.558946
75%         0.660158
max         1.000000
Name: cosine_sim, dtype: float64