# Experiment 4: CRSum using fastText word embeddings

In this experiment, summaries are generated by a CRSum model. CRSum is an atttention neural network trained to predict the cosine similarity of a sentence to a hypothetical summary. The actual summary is obtained by selecting the *n* sentences with the highest predicted similarity, where *n* is the desired number of sentences in the summary. The model is trained on pre-trained aligned fasText word embeddings (https://fasttext.cc/docs/en/aligned-vectors.html). Sentence embeddings are generated implicitly through the hidden layers of the model.

In [5]:
import pandas as pd
import tqdm
from rouge import Rouge 

In [6]:
from models.supervised import CRSum

In [7]:
from Preprocessors import CRSumPreprocessor
from Evaluator import USEevaluator

In [8]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [9]:
test_data = test_data.sort_values(by=['Language'])

In [10]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_Fasttext_Mean,Summary_Fasttext_SIF,Summary_USE
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,"""The pledged assets are with Rosneft, so it wi...",Rosneft officials were unavailable for comment...,State-owned Rosneft bought the Yugansk unit fo...
40,40,Japan's economy grew 2.6% overall last year - ...,Japan economy slides to recession The Japanese...,English,40,False,The Tokyo stock market fell after the figures ...,Gross domestic product fell by 0.1% in the las...,Gross domestic product fell by 0.1% in the las...
62,62,Both Boeing and Airbus have been taking orders...,Boeing unveils new 777 aircraft US aircraft fi...,English,62,False,"""Boeing has the latest variant in a very succe...",Better fuel efficiency from engines made by GE...,Boeing unveils new 777 aircraft US aircraft fi...
71,71,"The biggest slice of the 246,570 ID fraud case...",ID theft surge hits US consumers Almost a quar...,English,71,False,Another 18% came from attempts to rip off peop...,The report marks the fifth year in a row in wh...,Another 18% came from attempts to rip off peop...
4,4,"On an annual basis, the data suggests annual g...",Japan narrowly escapes recession Japan's econo...,English,4,False,"On an annual basis, the data suggests annual g...",The government was keen to play down the worry...,Revised figures indicated growth of just 0.1% ...


In [11]:
summarizer = CRSum(embedding_model=None, preprocessor=CRSumPreprocessor, M=5, N=5, verbose=False)

In [12]:
summarizer.loadWeights("best_model.h5")

In [13]:
comparator = USEevaluator(metric="cosine")

In [14]:
summaries = []
cosims = []

In [15]:
flatdict = {}
rouge = Rouge()

In [16]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2)
    except:
        smry = " "
    if smry == "":
        smry = " "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist
    cosims.append(comparator.compare(smry, row.Lead))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





In [17]:
test_data["Summary_CRSum"] = summaries

In [18]:
test_data.to_pickle('./training_data/test_raw.pkl')

In [19]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [20]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [21]:
test_data["cosine_sim"] = cosims

In [22]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_Fasttext_Mean,Summary_Fasttext_SIF,Summary_USE,Summary_CRSum,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r,cosine_sim
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,"""The pledged assets are with Rosneft, so it wi...",Rosneft officials were unavailable for comment...,State-owned Rosneft bought the Yugansk unit fo...,State-owned Rosneft bought the Yugansk unit fo...,0.54023,1.0,0.370079,0.534884,1.0,0.365079,0.580645,1.0,0.409091,0.619813
40,40,Japan's economy grew 2.6% overall last year - ...,Japan economy slides to recession The Japanese...,English,40,False,The Tokyo stock market fell after the figures ...,Gross domestic product fell by 0.1% in the las...,Gross domestic product fell by 0.1% in the las...,Japan economy slides to recession The Japanese...,0.56338,0.792079,0.437158,0.496454,0.7,0.384615,0.616162,0.813333,0.495935,0.910077
62,62,Both Boeing and Airbus have been taking orders...,Boeing unveils new 777 aircraft US aircraft fi...,English,62,False,"""Boeing has the latest variant in a very succe...",Better fuel efficiency from engines made by GE...,Boeing unveils new 777 aircraft US aircraft fi...,Boeing unveils new 777 aircraft US aircraft fi...,0.563003,0.789474,0.4375,0.479784,0.674242,0.372385,0.604839,0.78125,0.493421,0.811738
71,71,"The biggest slice of the 246,570 ID fraud case...",ID theft surge hits US consumers Almost a quar...,English,71,False,Another 18% came from attempts to rip off peop...,The report marks the fifth year in a row in wh...,Another 18% came from attempts to rip off peop...,ID theft surge hits US consumers Almost a quar...,0.242857,0.459459,0.165049,0.144928,0.277778,0.098039,0.237624,0.4,0.169014,0.285142
4,4,"On an annual basis, the data suggests annual g...",Japan narrowly escapes recession Japan's econo...,English,4,False,"On an annual basis, the data suggests annual g...",The government was keen to play down the worry...,Revised figures indicated growth of just 0.1% ...,Japan narrowly escapes recession Japan's econo...,0.722689,0.934783,0.589041,0.683761,0.888889,0.555556,0.787234,0.925,0.685185,0.86177


In [23]:
test_data.R2_f.describe()

count    8430.000000
mean        0.081394
std         0.126490
min         0.000000
25%         0.013333
50%         0.039604
75%         0.091324
max         1.000000
Name: R2_f, dtype: float64

In [24]:
test_data.R2_p.describe()

count    8430.000000
mean        0.074351
std         0.157535
min         0.000000
25%         0.008000
50%         0.026172
75%         0.063158
max         1.000000
Name: R2_p, dtype: float64

In [25]:
test_data.R2_r.describe()

count    8430.000000
mean        0.143076
std         0.153266
min         0.000000
25%         0.034483
50%         0.098039
75%         0.208333
max         1.000000
Name: R2_r, dtype: float64

In [26]:
test_data.cosine_sim.describe()

count    8430.000000
mean        0.512408
std         0.160481
min        -0.089866
25%         0.405888
50%         0.520212
75%         0.623798
max         1.000000
Name: cosine_sim, dtype: float64