# Experiment 4: CRSum using fastText word embeddings

In this experiment, summaries are generated by a CRSum model. CRSum is an atttention neural network trained to predict the cosine similarity of a sentence to a hypothetical summary. The actual summary is obtained by selecting the *n* sentences with the highest predicted similarity, where *n* is the desired number of sentences in the summary. The model is trained on pre-trained aligned fasText word embeddings (https://fasttext.cc/docs/en/aligned-vectors.html). Sentence embeddings are generated implicitly through the hidden layers of the model.

In [1]:
import pandas as pd
import tqdm
from rouge import Rouge 

In [2]:
from models.supervised import CRSum

In [3]:
from Preprocessors import CRSumPreprocessor

In [4]:
test_data = pd.read_pickle("./training_data/test_raw.pkl")

In [5]:
test_data = test_data.sort_values(by=['Language'])

In [6]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False


In [7]:
summarizer = CRSum(embedding_model=None, preprocessor=CRSumPreprocessor, M=5, N=5, verbose=False)

In [8]:
summarizer.loadWeights("best_model.h5")

In [9]:
summaries = []

In [10]:
flatdict = {}
rouge = Rouge()

In [11]:
for i, row in tqdm.tqdm_notebook(test_data.iterrows(), total=len(test_data.index)):
    try:
        smry = summarizer.summarize(row.Body, row.Language, 0.2)
    except:
        smry = " "
    if smry == "":
        smry = " "
    summaries.append(smry)
    flatlist = []
    scores = rouge.get_scores(smry, row.Lead)[0]
    for metric in scores:
        for key in scores[metric]:
            flatlist.append(scores[metric][key])
    flatdict[i] = flatlist

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=8430.0), HTML(value='')))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/swrdata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





In [12]:
test_data["Summary_USE"] = summaries

In [13]:
r_scores = pd.DataFrame.from_dict(flatdict, orient="index",
                       columns=['R1_f', 'R1_p', 'R1_r', 'R2_f', 'R2_p', 'R2_r','Rl_f', 'Rl_p', 'Rl_r'])

In [14]:
test_data = pd.merge(test_data, r_scores, left_index=True, right_index=True)

In [15]:
test_data.head()

Unnamed: 0,index,Lead,Body,Language,ID,isTrain,Summary_USE,R1_f,R1_p,R1_r,R2_f,R2_p,R2_r,Rl_f,Rl_p,Rl_r
2,2,Yukos' owner Menatep Group says it will ask Ro...,Yukos unit buyer faces loan claim The owners o...,English,2,False,Yukos unit buyer faces loan claim The owners o...,0.494845,0.716418,0.377953,0.364583,0.530303,0.277778,0.557143,0.75,0.443182
9523,9523,Solange Knowles claimed her Afro was searched ...,"Singer Solange Knowles , also known as younger...",English,9523,False,"Singer Solange Knowles , also known as younger...",0.487562,0.352518,0.790323,0.251256,0.181159,0.409836,0.458333,0.340206,0.702128
9525,9525,Abigail Mae Bresnik was born Saturday night as...,Space shuttle astronaut Randy Bresnik has welc...,English,9525,False,ET Sunday to announce the birth of Abigail Mae...,0.230216,0.143498,0.581818,0.094203,0.058559,0.240741,0.275862,0.192,0.489796
9526,9526,"In past two days , Project Coronado resulted i...",WASHINGTON The Justice Department on Thursday ...,English,9526,False,WASHINGTON The Justice Department on Thursday ...,0.354286,0.254098,0.584906,0.150289,0.107438,0.25,0.325926,0.247191,0.478261
9530,9530,NEW : The situation at most Northeast airports...,A winter storm blasting the Northeast caused s...,English,9530,False,A winter storm blasting the Northeast caused s...,0.335766,0.283951,0.410714,0.088889,0.075,0.109091,0.240741,0.213115,0.276596


In [16]:
test_data.R2_f.describe()

count    8430.000000
mean        0.080897
std         0.130507
min         0.000000
25%         0.012739
50%         0.037736
75%         0.088889
max         1.000000
Name: R2_f, dtype: float64