# Creating Word Embeddings for Centroid-based Text Summarization  

In this notebook, we will train the word embeddings as proposed in the paper *Centroid-based Text Summarization through Compositionality of Word Embeddings* 


## 1. Imports  

In [73]:
import pickle 
import preprocessor # this is the module/wrapper we created. 
import LDA_extractor
import numpy as np 
import pandas as pd

from preprocessor import spacy_preprocessor 
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases
from gensim.test.utils import common_texts, get_tmpfile
from IPython.display import display, HTML

# set up preprocessor
preproc = spacy_preprocessor()


In [74]:
# TESTS 
ex = "This is a sentence. We want to include some stuff. Help! There is not time." 
sent_tokd = preproc.sent_tokenizer(ex) 
print(sent_tokd)

# Word2Vec()
# model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

# Want to have this format: 
input_data = [['This', 'is', 'sentence', 'one'], 
              ['And', 'this', 'is', 'sentence', 'two']]

['This is a sentence.', 'We want to include some stuff.', 'Help!', 'There is not time.']


In [66]:
PATH = "../../data_raw/corpus.pkl"
with open(PATH,'rb') as file: 
    corpus = pickle.load(file) 
    
"""
corpus = {
    article_set_id: {
        'articles': [
            ...list of articles in the set
        ],
        'summaries': [
            ...list of human generated summaries
        ]
    },
    ...
}
"""

texts = [] 
for article_set_id in corpus.keys(): 
    for article in corpus[article_set_id]['articles']: 
        texts+= [article] 
    
print(type(texts))
print(type(texts[0]))
print(texts[0])

<class 'list'>
<class 'str'>
Cambodian leader Hun Sen on Friday rejected opposition parties' demands for talks outside the country, accusing them of trying to ``internationalize'' the political crisis. Government and opposition parties have asked King Norodom Sihanouk to host a summit meeting after a series of post-election negotiations between the two opposition groups and Hun Sen's party to form a new government failed. Opposition leaders Prince Norodom Ranariddh and Sam Rainsy, citing Hun Sen's threats to arrest opposition figures after two alleged attempts on his life, said they could not negotiate freely in Cambodia and called for talks at Sihanouk's residence in Beijing. Hun Sen, however, rejected that. ``I would like to make it clear that all meetings related to Cambodian affairs must be conducted in the Kingdom of Cambodia,'' Hun Sen told reporters after a Cabinet meeting on Friday. ``No-one should internationalize Cambodian affairs. It is detrimental to the sovereignty of Camb

Now we have a list of strings, where each string is a text. We want to tokenize by sentences, and then have one huge list of sentences. 

In [82]:
sentences = [] 

# Iterate through every text
for text in texts:  
    text_sents = preproc.sent_tokenizer(text)  # sent_tokenize that text 
    sentences += text_sents # append to the collection 

print(len(sentences)) # we have these many sentences 
print(sentences[0])

13270
Cambodian leader Hun Sen on Friday rejected opposition parties' demands for talks outside the country, accusing them of trying to ``internationalize'' the political crisis.


The next step is to clean all of these sentences. We use the spacy_preprocessor from our preprocessor module. 

In [91]:
clean_corpus = preproc.preprocess_texts(
                            sentences, 
                            tags = ["DET","NUM","SPACE"], 
                            custom_filter = [], 
                            remove_punct = True, 
                            regex_pattern = '', 
                            stem=False, 
                            lemmatize=False, 
                            join=False, 
                            min_len=1)

In [92]:
print(clean_corpus[0:2])

[['cambodian', 'leader', 'hun', 'sen', 'friday', 'rejected', 'opposition', 'parties', 'demands', 'talks', 'outside', 'country', 'accusing', 'trying', 'internationalize', 'political', 'crisis'], ['government', 'opposition', 'parties', 'asked', 'king', 'norodom', 'sihanouk', 'host', 'summit', 'meeting', 'series', 'post', 'election', 'negotiations', 'opposition', 'groups', 'hun', 'sen', 'party', 'form', 'new', 'government', 'failed']]


Now we train the actual thing using word2vec.

## Skip-gram training 

Train the model using word2vec. 

In [95]:
skip_gram = Word2Vec(clean_corpus,
                     min_count=0, # ignore words that appear less than this 
                     size= 400, # dimensionality of vectors
                     workers=-1, # number of workers 
                     window=10, # window size 
                     iter=100, # iteration 
                     hs = 1, # hierarchical softmax 
                     negative=10,  # negative sampling? 
                     sg = 1)

In [98]:
skip_gram.most_similar("parties")

  """Entry point for launching an IPython kernel.


[('cameras', 0.1896907091140747),
 ('scrutiny', 0.18311579525470734),
 ('shooter', 0.17772801220417023),
 ('bodyguards', 0.17234481871128082),
 ('taint', 0.17103847861289978),
 ('role', 0.16631092131137848),
 ('spared', 0.16630259156227112),
 ('badly', 0.16278895735740662),
 ('brand', 0.15966100990772247),
 ('wrestled', 0.15910710394382477)]