文本摘要的应用场景有很多，比如搜索引擎、观点抽取、新闻、汇报文档等。

摘要技术分为两类：

- Extractive是从文中找出关键信息，然后拼接进行结果输出
    - 关键信息识别抽取
- Abstracrtive: 依据文本的输入，生成单词（可能是新的单词）进行结果输出
    - Seq2Seq
    - Pointer Generator
    - Transfomer

项目流程
1. data analysis
1. data process
1. 基于sentence embedding的关键句信息抽取
    - 距离度量：cosine similarity
1. 语句流畅性平滑
    - 近邻sentence embedding平均化平滑方法
1. Title、keywords加权修正
    - 标题的embedding赋予更高的权重，在相似性计算时进行处理
    - textrank关键词提取，计算sentence embedding时加权处理
    - 基于位置信息的加权处理：段落开端，结尾一般会更加重要
1. 基于LDA的主题相关性修正

## Unsupervised extractive Summarization

#### data analysis

In [None]:
import pandas as pd


raw_data = pd.read_csv('sqlResult_1558435.csv', encoding='gb18030')
raw_data.head(3)

In [None]:
raw_data.feature[1]

In [None]:
raw_data.content[1]

In [None]:
raw_data.title[1]

In [None]:
raw_data.url[1]

In [None]:
raw_data.source[1]

In [None]:
raw_data.dtypes

In [None]:
raw_data.info()

In [None]:
pd.notnull(raw_data.content)

#### 筛选掉没有意义的内容

In [None]:
raw_data.content.apply(lambda x: len(str(x))).plot()

In [None]:
raw_data.content.apply(lambda x: len(str(x))).describe()

In [None]:
useless_index = []

for i, c in raw_data.content.items():
    if len(str(c)) <= 120:
        useless_index.append(i)
        if len(str(c)) > 100:
            print(raw_data.url[i], '||', raw_data.content[i])


# raw_data.content.isnull()

In [None]:
len(useless_index)

In [None]:
useless_index_long = []

for i, c in raw_data.content.items():
    if len(str(c)) <= 30000:
        useless_index_long.append(i)
        if len(str(c)) > 10000:
            print(raw_data.url[i], 'Len: ',len(str(c)))
            print('index(pandas中):', i)
            print(raw_data.content[i][:500])
            print('=========')
            print(raw_data.content[i][-500:])
            print('#########')
            print()

In [None]:
useless_index.extend([3117,6221, 10052, 27862,62823, 48328,62823,76116,79555, 82780, 84244])

发现：content内容中多次出现"外代二线"的新闻没有summarize的需要，处理时应该删除

In [None]:
useless_index_words = []

def deal_some_words():
    t = 0
    for i, c in raw_data.content.items():
        t += 1
        split_res = str(c).split('外代二线')
        if len(split_res) >= 6:
            useless_index_words.append(i)
            if t % 5 == 0:
                print(raw_data.url[i], 'Len: ',len(str(c)))
                print('index(pandas中):', i)
                print(raw_data.content[i][:500])
                print('=========')
                print(raw_data.content[i][-500:])
                print('#########')
                print()
                
deal_some_words()

In [None]:
len(useless_index_words)

In [None]:
for i in useless_index_words:
    if i not in useless_index:
        useless_index.append(i)

In [None]:
data = raw_data.drop(pd.Index(useless_index))

In [None]:
data.info()

In [None]:
data.content.apply(lambda x: len(str(x))).plot()

In [None]:
data.to_csv('clean_data_len_gt_120.csv', encoding='utf-8')

In [None]:
del raw_data

### Model

#### 分句

In [None]:
import pandas as pd

data = pd.read_csv('clean_data_len_gt_120.csv', encoding='utf-8')

In [None]:
from pyltp import SentenceSplitter  # 淘汰


def split_sentence(doc):
#     doc = doc.strip().replace(u'\u3000', u'').replace(u'\\n', u'。').replace(u'(。)+', u'。').replace(' ', '')
    return [sent for sent in SentenceSplitter.split(doc) if len(sent)>5]



import re
from functools import reduce


def split_to_sentence(doc, min_len=6):
    """自定义的分段分句
    
    return:
        list, 储存内容为每一段分句结果的list，index信息可以用于后续位置特征计算
    """
#     pattern = re.compile(".*?[。?？!！]")  # 非贪婪模式匹配文字内容
    
    paragraph_gen = split_to_paragraph(doc)
    doc_content = []
    for para in paragraph_gen:
        if para is None:
            continue
        elif len(para) <= min_len:
            continue
        
        doc_content.append(split_sentence(para))
#         elif para.strip()[-1] in '。?？!！"“”':
#             sent_of_para = re.findall(pattern, para)
#             doc_content.append(sent_of_para)
#         else: 
#             doc_content.append([para])
            
    return doc_content


def split_to_paragraph(doc):
    """为了识别靠近段开头和结尾位置，需要单独输出句子位置特征
    
    return:
        filter结果生成器
    """
    pattern = re.compile(r"(\r\n\u3000\u3000)|(\r\n)|(\u3000\u3000)|(\\n)")
    res = re.split(pattern, doc)
    for i in res:
        if i and len(i) > 5:
            yield i
            

In [None]:
            
def split_to_sentence(doc, min_len=6, use_re=False):
    """自定义的分段分句

    return:
        list, 储存内容为每一段分句结果的list，index信息可以用于后续位置特征计算
    """
    if use_re:
        pattern = re.compile(".*?[。?？!！]")  # 非贪婪模式匹配文字内容

    paragraph_gen = split_to_paragraph(doc)
    doc_content = []
    for para in paragraph_gen:
        if para is None:
            continue
        elif len(para) <= min_len:
            continue

        if not use_re:
            doc_content.append(split_sentence(para))
        else:
            if para.strip()[-1] in '。?？!！"”':
                sent_of_para = re.findall(pattern, para)
                doc_content.append(sent_of_para)
            else:
                doc_content.append([para])

    return doc_content

In [None]:
split_to_sentence(data.content[1312])

#### 分词

In [None]:
import jieba

from pyhanlp import *

几种主流分词器对比

In [None]:
def cut(string): return ' '.join(jieba.cut(string))

In [None]:
s = '从大的环境上来看，市场目前本身不具备大面积和大空间的反弹基础，因为目前无论是从宏观面、货币基本面或者从国际经济和政治的角度来看，都不具备这样的条件，所以反应到市场中来，只能是结构性、局部性的投机性机会。而最近半个月以来，市场的走势也确实符合局部性、结构性投机的走势。'
cut(s)

In [None]:
print(HanLP.segment(s))

print(StandardTokenizer.segment(data.content[1]))

StandardTokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')
print(StandardTokenizer.segment(s))

# 带命名实体识别
NLPTokenizer = JClass('com.hankcs.hanlp.tokenizer.NLPTokenizer')
print(NLPTokenizer.segment(s))

In [None]:
string = '''网易娱乐7月21日报道 林肯公园主唱查斯特·贝宁顿Chester Bennington于今天早上，在洛杉矶帕洛斯弗迪斯的一个私人庄园自缢身亡，年仅41岁。此消息已得到洛杉矶警方证实。
　　洛杉矶警方透露，Chester的家人正在外地度假，Chester独自在家，上吊地点是家里的二楼。一说是一名音乐公司工作人员来家里找他时发现了尸体，也有人称是佣人最早发现其死亡。
　　林肯公园另一位主唱麦克·信田确认了Chester Bennington自杀属实，并对此感到震惊和心痛，称稍后官方会发布声明。Chester昨天还在推特上转发了一条关于曼哈顿垃圾山的新闻。粉丝们纷纷在该推文下留言，不相信Chester已经走了。
　　外媒猜测，Chester选择在7月20日自杀的原因跟他极其要好的朋友、Soundgarden(声音花园)乐队以及Audioslave乐队主唱Chris Cornell有关，因为7月20日是Chris Cornell的诞辰。而Chris Cornell于今年5月17日上吊自杀，享年52岁。Chris去世后，Chester还为他写下悼文。
　　对于Chester的自杀，亲友表示震惊但不意外，因为Chester曾经透露过想自杀的念头，他曾表示自己童年时被虐待，导致他医生无法走出阴影，也导致他长期酗酒和嗑药来疗伤。目前，洛杉矶警方仍在调查Chester的死因。
　　据悉，Chester与毒品和酒精斗争多年，年幼时期曾被成年男子性侵，导致常有轻生念头。Chester生前有过2段婚姻，育有6个孩子。
　　林肯公园在今年五月发行了新专辑《多一丝曙光One More Light》，成为他们第五张登顶Billboard排行榜的专辑。而昨晚刚刚发布新单《Talking To Myself》MV。'''

print(HanLP.extractSummary(string, 9))

In [None]:
from tqdm import tqdm, tqdm_notebook
import itertools
from itertools import chain

In [None]:
StandardTokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

def segment(content):
    """在split_to_sentence的基础上，生成分词文件。采用hanlp的StandardTokenizer。
    
    
    """
    total_tokens = []
    sents = split_to_sentence(content)
    for sent in chain.from_iterable(sents):
        tokens = [item.word for item in StandardTokenizer.segment(sent)]
        total_tokens.extend(tokens)
    return ' '.join(total_tokens)

In [None]:
test = data[: 20].copy()
test.content[2]
test.content.apply(segment)

In [None]:
test.tokens[1]

In [None]:
data['tokens'] = data.content.progress_apply(segment)

In [None]:
data.tokens[1]

In [None]:
data.to_csv('clean_data_len_gt_120.csv')

fasttext

In [None]:
import time

In [None]:
with open('tokens.txt', 'w', encoding='utf-8') as f:
    for c in tqdm_notebook(data.tokens):
        f.write(c)
        f.write('\n')

In [None]:
from gensim.models import FastText
from gensim.models.word2vec import LineSentence

corpus_file = 'tokens.txt'  # absolute path to corpus
model = FastText(window=5, size=200,  min_count=1, workers=2)
model.build_vocab(corpus_file=)  # scan over corpus to build the vocabulary

total_words = model.corpus_total_words  # number of words in the corpus
model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)

In [None]:
# 预料太少，无监督模式的fasttext效果依然不好
from gensim.models import KeyedVectors

%%time
word_vec = KeyedVectors.load_word2vec_format('sgns.target.word-character.char1-2.dynwin5.thr10.neg5.dim300.iter5')

In [None]:
word_vec.similar_by_word('伤心')

#### 词频、概率

In [None]:
from tqdm import tqdm, tqdm_notebook

def count_gen():
    corpus_dict = []
    for c in tqdm_notebook(data.tokens):
        corpus_dict.extend(c.split())
    return corpus_dict

In [None]:
corpus_dict = count_gen()

In [None]:
from collections import Counter

total_counter = Counter(corpus_dict)

length = len(corpus_dict)

frequence = {w: c/length for w, c in total_counter.items()}

In [None]:
import pickle

In [None]:
with open('frequence.bin', 'wb') as f:
    pickle.dump(frequence, f)

In [None]:
with open('frequence.bin', 'rb') as f:
    frequence = pickle.load(f, encoding='uft-8')

In [None]:
occurences_frequences = sorted(list(frequence.values()), reverse=True)

In [None]:
occurences_frequences[: 10]

In [None]:
occurences_frequences[-10:]

#### Textrank关键词、关键句抽取

In [None]:
tokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

In [None]:
data.content[0]

In [None]:
sent_list = ["除小米手机6等15款机型外，其余机型已暂停更新发布，以确保工程师可以集中全部精力进行系统优化工作。", "有人猜测这也是将精力主要用到MIUI 9的研发之中。", ""]

In [None]:
# 用于生成测试数据
sent_word_list = []
for sent in sent_list[:-1]:
    tmp = []
    for item in tokenizer.segment(sent):
        if str(item.nature) in allow_speech_tags and str(item.word) not in stopwords:
            tmp.append(item.word)
    sent_word_list.append(tmp)

In [None]:
def stopwordslist():
    """创建停用词列表"""
    stopwords = {line.strip()
        for line in open('stopwords.txt', encoding='UTF-8').readlines()}
    stopwords.add('\u3000')
    return stopwords
    
stopwords = stopwordslist()

In [None]:
# coding:utf-8

import numpy as np

from collections import defaultdict, Counter
from itertools import chain
from pyhanlp import *
from math import log
# from utils import split_to_sentence

class TextRank:
    def __init__(self, stopwords, allowed_pos):
        self.stopwords = stopwords
        self.allowed_pos = frozenset(allowed_pos)

        self.tokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

    def process_input(self, doc_str, case='keyword'):
        "处理输入文档。输出结果格式为：[['sent', 'one', 'words'],['sent', 'two', 'words']]"
        self.sent_of_words = []
        sent_list = split_to_sentence(doc_str)
        for sent in chain.from_iterable(sent_list):
            tmp = []
            for item in self.tokenizer.segment(sent):
                if case == 'keyword':
                    tmp.append(item)
                elif case == 'keysentence':
                    tmp.append(str(item.word))
            self.sent_of_words.append(tmp)

    def get_keywords(self, doc_str, num=10, min_len=2, span=5):
        self.process_input(doc_str, case='keyword')
        # 找到候选
        count_dic = defaultdict(int)
        word_set = set()
        for sent in self.sent_of_words:
            for i, w_item in enumerate(sent):
                if self.filte_words(w_item, min_len):
                    word = str(w_item.word)
                    word_set.add(word)
                    for j in range(i+1, i+span):
                        if j >= len(sent): break
                        if not self.filte_words(sent[j], min_len):
                            continue
                        word_j = str(sent[j].word)
                        word_set.add(word_j)
                        count_dic[(word, word_j)] += 1
                        count_dic[(word_j, word)] += 1
        word2id = {word: i for i, word in enumerate(list(word_set))}
        id2word = {i: word for word, i in word2id.items()}

        # 建立共线矩阵
        num_of_nodes = len(word2id)
        weight_M = np.zeros((num_of_nodes, num_of_nodes))
        for (wi, wj), weight in count_dic.items():
            i = word2id[wi]
            j = word2id[wj]
            weight_M[i, j] = weight
        
        weight_M = np.nan_to_num(weight_M / np.linalg.norm(weight_M, ord=1, axis=0, keepdims=True))
        # pagerank求解
        textrank_v = self.page_rank(weight_M)
        result = sorted([(id2word[i], value) for i, value in enumerate(textrank_v)],
                        key=lambda x: x[1],
                        reverse=True)
        return result[: num]

    def get_keysentences(self, doc_str, num=6, min_len=5):
        """由于sentence在一段话中几乎不可能出现完全一样的情况，因此只基于共现的pagerank是行不通的。
        引入BM25，来计算句子与句子之间的关联权重。注：BM25原本是用来计算query句子和文档之间的相似度，用于信息检索的、
        """
        self.process_input(doc_str, case='keysentence')
        total_sent = len(self.sent_of_words)

        weight_M = np.zeros((total_sent, total_sent))
        for i in range(total_sent):
            sent_i = self.sent_of_words[i]
            for j in range(total_sent):
                if i == j: continue
                sent_j = self.sent_of_words[j]
                # 权重矩阵中的i行，j列
                weight_M[i, j] = self.sent_corelation_func(sent_i, sent_j)
        
        weight_M = np.nan_to_num(weight_M / np.linalg.norm(weight_M, ord=1, axis=0, keepdims=True))
        
        sent_para = split_to_sentence(doc_str)
        ps_weight = get_position_weight(sent_para)
        textrank_v = self.page_rank(weight_M) * np.array(ps_weight)
        print(textrank_v)
        result_id = sorted([(idx, value) for idx, value in enumerate(textrank_v)],
                           key=lambda x: x[1],
                           reverse=True
                           )
        count = 0
        result_sent = []
        for (i, value) in result_id:
            if count >= num:
                break
            sent = ''.join(self.sent_of_words[i])
            if len(sent) <= min_len:
                continue
            result_sent.append((sent, value, i))
            count += 1
        
        result_sent = sorted(result_sent, key=lambda x: x[2])
        return result_sent


    def get_tf(self, sent_i, sent_j):
        """计算bm25的term frequence. sent来自预处理的sent_of_words列表。"""
        freq = {}
        sent_i_counts = Counter(sent_i)
        # 计算i句中的词，在j句中的tf
        for w in sent_j:
            # if not self.filte_words(w_item):
            #     continue
            if w in sent_i_counts:
                freq[w] = sent_i_counts[w]
            else:
                freq[w] = 0
        total = len(sent_i)
        return {word: count / total for word, count in freq.items()}

    def get_idf(self):
        """计算inverse document frequence. 这里计算句子的相似度，所以计算inverse sentence frequence"""
        total_sent = len(self.sent_of_words) + 1 # 假设有一个句子包含所有词
        avg_len = 0
        doc_freq = {}
        for sent in self.sent_of_words:
            avg_len += len(sent)
            words = list({w for w in sent})
            for word in words:
                # 假设有一个句子包含所有词
                doc_freq[word] = doc_freq.setdefault(word, 0) + 1
        avg_len /= total_sent
        # sklearn实现
        idf = {word: log(total_sent / df) + 1 for word, df in doc_freq.items()}
        return idf, avg_len

    def filte_words(self, w_item, min_len=2):
        word = str(w_item.word)
        pos = str(w_item.nature)
        return (pos in self.allowed_pos and word not in self.stopwords
                and len(word) >= min_len)

    def sent_corelation_func(self, sent_i, sent_j, k1=1.5, b=0.75):
        """计算bm25。

        sent_i ： 与query对比的句子，在文档中进行遍历
        sent_j : query的句子
        """
        idf, avg_len = self.get_idf()
        tf = self.get_tf(sent_i, sent_j)

        K = k1 * (1 - b + b * len(sent_i) / avg_len)
        bm25 = 0
        for j_word in sent_j:
            bm25 += idf[j_word] * tf[j_word] * (k1 + 1) / (tf[j_word] + K)
        return bm25

    @staticmethod
    def page_rank(weight_M, iterations=100, d=0.85):
        """
        weight_M: 对于textRank，这是窗口遍历文档所得的符合条件的边的权重矩阵。
                  pageRank中第i行、第j列表示：从j节点到i节点的链接权重。
                  但是textRank是无向图，只是两者的共性关系权重。
        d： 衰减系数，防止局部陷入无法向外链接
        """
        N = weight_M.shape[1]
        v = np.random.rand(N, 1)
        v = v / np.linalg.norm(v, 1)
        M_hat = (d * weight_M + (1 - d) / N)
        for i in range(iterations):
            v = M_hat @ v
        return v.ravel()


In [None]:
allow_speech_tags = set('nz ni ntc j ntcb nt nhm nic nn t g n nnd ntch nit gb gbc nb nnt nba nr an gc nbc nr1 gg nbp nr2 gi nf nrf gm ng nrj gp nh ns nhd nsf i v vl vi vd nl'.split())

# jieba
# allow_speech_tags =  ['an', 'i', 'j', 'l', 'n', 'nr', 'nrfg', 'ns', 'nt', 'nz', 't', 'v', 'vd', 'vn', 'eng']

In [None]:
textrank = TextRank(stopwords, allow_speech_tags)

In [None]:
data.content[11]

In [None]:
textrank.get_keywords(data.content[11])

In [None]:
textrank.get_keysentences(data.content[13])

In [None]:
sorted(textrank.get_keywords(data.content[11]), key=lambda x: x[1], reverse=True)

In [None]:
for i in HanLP.extractKeyword(data.content[11], 10):
    print(i)

In [None]:
string ='''据国外媒体报道，据英国《卫报》报道，热带飓风“阿加莎”31日席卷中美洲，它带来的倾盆大雨已经夺去100人的性命。在这场飓风的影响下，危地马拉首都危地马拉城出现一个深达60米的塌陷洞，据说有一栋3层建筑坠入洞中。    受2010年首个太平洋热带风暴影响，危地马拉城积聚了1米多深的雨水，这场风暴还影响到萨尔瓦多和洪都拉斯。据报告上称，目前危地马拉至少已有113人丧生，大约有50人失踪，营救队员正在一片瓦砾中进行搜救。    这个直径30米的塌陷洞位于危地马拉北部地区。当地居民称，雨水和排水系统不完善导致地面塌陷。当地报道表示，在那座3层建筑物坠入地洞时，至少有1人丧生。2007年，这一地区出现类似塌陷洞，当时有3人丧生。    危地马拉是受“阿加莎”影响最严重的一个国家，经证实，该地目前死亡人数已达92人，搜救人员进入偏远农村后，这一数字有可能还会继续上升。有大约10万人被迫撤离家园。警方称，萨尔瓦多有9人死亡，洪都拉斯有12人丧生。    阿马蒂特兰的卡尔洛塔·拉莫斯站在几乎被淤泥淹没的房屋前悲伤地说：“没有人可以帮助我。我眼睁睁看着雨水冲走了一切。'''

from jieba import analyse

textrank = analyse.textrank

text = string

allowPOS = ('an', 'i', 'j', 'l', 'n', 'nr', 'nrfg', 'ns', 'nt', 'nz', 't', 'v', 'vd', 'vn', 'eng')

print("keywords by textrank:")
keywords = textrank(
    text,
    topK=10,
    withWeight=True,
    allowPOS=allowPOS,
    withFlag=False)

words = [(keyword, w) for keyword, w in keywords if w > 0.1]
print(words)

#### sentence embedding

In [None]:
import numpy as np
from scipy.spatial.distance import cosine
from pyhanlp import *
from gensim.models import KeyedVectors

In [None]:
word_vec = KeyedVectors.load_word2vec_format('sgns.target.word-character.char1-2.dynwin5.thr10.neg5.dim300.iter5')

In [None]:
tokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

In [None]:
def cal_sentences_vec_mat(sent_list, prob_dict, param_a=0.0001):
    """计算sentence vector，在原论文的基础上进行修改，语义建模引入整个文档和标题信息。
    来自paper:
    A SIMPLE  BUTTOUGH-TO-BEATBASELINE  FORSEN-TENCEEMBEDDINGS. ICLR2017
    
    sent_list: 来自待识别文档的分句结果, list；
    prob_dict: 语料库中的token概率值, dict；
    param_a: 论文中实验得到的效果比较好的参数取值, 1e-3 ~ 1e-5；
    
    return: 
        matrix--(vector_dim, sentence_num + 1) 
                形状的matrix，每一列代表sentence的向量. 多出的1为doc的向量.
                sentence_num中最后一个sent为title
        doc_vector--整个文档的向量表达
    """
    row_size = word_vec.vector_size
    col_size = len(sent_list)
    
    doc_vector = np.zeros(row_size)
    matrix = np.zeros((row_size, col_size + 1))  # +1为整个文档的向量表示
    
    default_p = max(prob_dict.values())
    doc_len = 0
    for i, sentence in enumerate(sent_list):
        sentence = tokenizer.segment(sentence)
        sent_len = len(sentence)
        doc_len += 1
        
        sent_vector = matrix[:, i]
        for item in sentence:  # 计算第i句的sent_vector
            token = str(item.word)
            pw = prob_dict.setdefault(token, default_p)
            weight = param_a / (param_a + pw)
            try:
                word_vector = np.array(word_vec.get_vector(token))
                sent_vector += weight * word_vector
            except Exception:
                continue
        
        matrix[:, i] = sent_vector / sent_len
        doc_vector += matrix[:, i]
    matrix[:, -1] = doc_vector / doc_len
    
    print(matrix)
    matrix = np.nan_to_num(matrix)
    # PCA找到整个矩阵中，每个句子中最相似的部分（第一个主成分），然后减去相似部分
    U, s, Vh = np.linalg.svd(matrix)  # 默认s降序
    u = U[:, 0]  # 第一个主成分
    matrix -= np.outer(u, u.T) @ matrix  # 每个sent_vector减去在第一个主成分方向的投影
    
    doc_vector = matrix[:, -1]
    title_vector = matrix[:, -2]
    return matrix, title_vector, doc_vector

In [None]:
u = np.array([1,2,3]).T
matrix = np.zeros((3, 3))
matrix[:, -1] = np.array([1,1,1])

In [None]:
np.outer(u, u.T) @ matrix

In [None]:
def get_position_weight(sent_para):
    """开头，结尾增加一些权重
    
    return:
        从第一句到最后一句的位置权重，list
    """
    pos_sent_weight = []
    first_para_flag = True
    
    for i, para in enumerate(sent_para):
        if len(para) > 1:
            # 每一段开头结尾
            tmp = [1.1] + [1. for i in range(len(para)-2)] + [1.08] 
        else:
            tmp = [1.]
        
        # 第一段
        if first_para_flag:
            tmp = [1.1 * i for i in tmp]
            first_para_flag = False
        # 最后一段
        elif i == len(sent_para) - 1 and len(para[-1]) >= 10:
            tmp = [1.08 * i for i in tmp]
        
        pos_sent_weight.extend(tmp)
    return pos_sent_weight


def neighbor_smooth():
    """."""
    # 在计算embedding时，计入
    pass


# def get_title_info(title, prob_dict, param_a=0.0001):
#     """标题信息"""
#     tokens = [item.word for item in StandardTokenizer.segment(title)]
#     size = word_vec.vector_size
    
#     default_p = max(prob_dict.values())
#     title_vec = np.zeros(size)
#     print(tokens)
#     for word in tokens:
#         pw = prob_dict.setdefault(word, default_p)
#         weight = param_a / (param_a + pw)
#         try:
#             word_vector = np.array(word_vec.get_vector(word))
#             title_vec += weight * word_vector
#         except Exception:
#             continue
#     title_vec /= len(tokens)
#     return title_vec, tokens


def get_keywords(content):
    """对包含的关键字/词句子增加其权重"""
    # textrank获取关键词
    # 加权再get_position_weight中实现
    return HanLP.extractKeyword(content, 5)



from gensim import corpora, models, similarities
    
lda = models.LdaModel.load('lda_model.bin')
dictionary = corpora.Dictionary.load('lda_dictionary.bin')

num_topics = 10
topic_words_dist = []
for topicid in range(num_topics):
    topic_words = [w for w, _ in lda.show_topic(topicid, topn=10)]
    topic_words_dist.append(topic_words)


def get_topic_distribution(sent_para):
    """用每句话和的出来的这些主题进行相似度对比，我们不仅仅是是考虑他的整个的text， 我们还有考虑主题.
    使用LDA主题模型，得到的主题分布。
    
    return:
        topic_dist：
            format--[(1, 0.018213129),
                    (2, 0.06460305),
                    (3, 0.114253126),
                    (5, 0.21796304),
                    (6, 0.03961128),
                    (9, 0.5442903)]
    """
    tokens = segment(sent_para, stopwords)
    bow_doc = dictionary.doc2bow(tokens)
    topic_dist = lda.get_document_topics(bow_doc)
    
    # 根据主题分布，和每个主题中word的分布，获得需要的主题词的分布
    return topic_dist


def cal_topic_embedding(sent_para, prob_dict, param_a=0.0001):
    """根据主题分布，每个主题的词分布，获取topic embedding。
    
    return:
        vector_size大小的一维vector
    """
    topic_dist = get_topic_distribution(sent_para)
    size = word_vec.vector_size
    
    default_p = max(prob_dict.values())
    topics_vector = np.zeros(size)
    while topic_dist:
        # topic weight加权
        topicid, t_weight = topic_dist.pop()
        topic_words = topic_words_dist[topicid]
        
        # 与计算sentence embedding的方法保持一致
        topic_vector = np.zeros(size)
        for word in topic_words:
            pw = prob_dict.setdefault(word, default_p)
            w_weight = param_a / (param_a + pw)
            try:
                word_vector = np.array(word_vec.get_vector(word))
                topic_vector += w_weight * word_vector
            except Exception:
                continue

        topics_vector += topic_vector
        
    topics_vector /= num_topics * 10  # 每个topic选取10个词来表示
        
    return topics_vector

In [None]:
def get_summary(doc, title, window):
    sent_para = split_to_sentence(doc)
    pos_sent_weight = get_position_weight(sent_para)
    
    sent_list = [sent.strip() for sent in chain.from_iterable(sent_para)]

    if not title: title = sent_list[0] + sent_list[1]
    sent_list.append(title)
    
#     print(sent_list)
    sent_vecs, title_vec, doc_vec = cal_sentences_vec_mat(sent_list, frequence)
#     print(sent_vecs)
#     print()
#     print(doc_vec)
#     print()
    
    # 由于lda从文档中抽象出topic实际上时对语义信息的另一种建模，不加入sentence embedding算法实现
    topics_vec = cal_topic_embedding(sent_para, frequence)
#     print(topics_vec)
#     print()
    
#     title_vec, tokens = get_title_info(title, frequence)
#     print(title_vec)

#     keyword
    textrank = TextRank(stopwords, allow_speech_tags)
    keywords = textrank.get_keywords(doc)
    
    keysentence = textrank.get_keysentences(doc)
    
    print(keywords)
    
    scores = []
    print(len(sent_list))
    print(sent_vecs.shape)
    for i in range(sent_vecs.shape[1] - 2):
        sent_to_doc = cosine(sent_vecs[:, i], doc_vec) * pos_sent_weight[i]
        sent_to_topic = cosine(sent_vecs[:, i], topics_vec)
        sent_to_title = cosine(sent_vecs[:, i], title_vec)
        
        score = sent_to_doc
        score = sent_to_doc + sent_to_topic + sent_to_title
        
        for i, (kw, values) in enumerate(keywords):
            if kw in sent_list[i]:
                # 根据value大小顺序，递减权重
                score *= (1 + 0.5 * (10 - i * 0.5) / 10)

        scores.append(score)
    
#     print(keywords)
    
    # 对于一个sentence，它的重要性，取决于本身的重要性和周围的句子(neighbors)的重要性的综合
    for i in range(window):
        scores.insert(0, scores[0])
        scores.append(scores[-1])
    weight = np.array([0.25, 0.5, 0.25])
    print(scores)
    scores = np.array(scores)
    score_smooth = [np.dot(scores[i - window: i + window + 1], weight) for i in range(window, len(sent_list) - 1 + window)]
    print()
    
    
    assert len(sent_list) - 1 == len(score_smooth)
    print(pos_sent_weight)
    print(score_smooth)
    sorted_idx = np.argsort(score_smooth)[-len(sent_list)//3: ]
    sent_ids = sorted(sorted_idx)
    for i in sent_ids:
        print(sent_list[i])
    
    print(''.join([sent_list[i] for i in sent_ids]))
    print(keysentence)
    
# get_summary(data.content[1312], data.title[1312], window=1)

In [None]:
string ='''据国外媒体报道，据英国《卫报》报道，热带飓风“阿加莎”31日席卷中美洲，它带来的倾盆大雨已经夺去100人的性命。在这场飓风的影响下，危地马拉首都危地马拉城出现一个深达60米的塌陷洞，据说有一栋3层建筑坠入洞中。    受2010年首个太平洋热带风暴影响，危地马拉城积聚了1米多深的雨水，这场风暴还影响到萨尔瓦多和洪都拉斯。据报告上称，目前危地马拉至少已有113人丧生，大约有50人失踪，营救队员正在一片瓦砾中进行搜救。    这个直径30米的塌陷洞位于危地马拉北部地区。当地居民称，雨水和排水系统不完善导致地面塌陷。当地报道表示，在那座3层建筑物坠入地洞时，至少有1人丧生。2007年，这一地区出现类似塌陷洞，当时有3人丧生。    危地马拉是受“阿加莎”影响最严重的一个国家，经证实，该地目前死亡人数已达92人，搜救人员进入偏远农村后，这一数字有可能还会继续上升。有大约10万人被迫撤离家园。警方称，萨尔瓦多有9人死亡，洪都拉斯有12人丧生。    阿马蒂特兰的卡尔洛塔·拉莫斯站在几乎被淤泥淹没的房屋前悲伤地说：“没有人可以帮助我。我眼睁睁看着雨水冲走了一切。'''
title = '''危地马拉受热带风暴影响出现60米深巨大陷坑'''
get_summary(string, title, window=1)b

In [None]:
s = """网易娱乐7月21日报道 林肯公园主唱查斯特·贝宁顿Chester Bennington于今天早上，在洛杉矶帕洛斯弗迪斯的一个私人庄园自缢身亡，年仅41岁。此消息已得到洛杉矶警方证实。

　　洛杉矶警方透露，Chester的家人正在外地度假，Chester独自在家，上吊地点是家里的二楼。一说是一名音乐公司工作人员来家里找他时发现了尸体，也有人称是佣人最早发现其死亡。

　　林肯公园另一位主唱麦克·信田确认了Chester Bennington自杀属实，并对此感到震惊和心痛，称稍后官方会发布声明。Chester昨天还在推特上转发了一条关于曼哈顿垃圾山的新闻。粉丝们纷纷在该推文下留言，不相信Chester已经走了。
　　外媒猜测，Chester选择在7月20日自杀的原因跟他极其要好的朋友、Soundgarden(声音花园)乐队以及Audioslave乐队主唱Chris Cornell有关，因为7月20日是Chris Cornell的诞辰。而Chris Cornell于今年5月17日上吊自杀，享年52岁。Chris去世后，Chester还为他写下悼文。
　　对于Chester的自杀，亲友表示震惊但不意外，因为Chester曾经透露过想自杀的念头，他曾表示自己童年时被虐待，导致他医生无法走出阴影，也导致他长期酗酒和嗑药来疗伤。目前，洛杉矶警方仍在调查Chester的死因。
　　据悉，Chester与毒品和酒精斗争多年，年幼时期曾被成年男子性侵，导致常有轻生念头。Chester生前有过2段婚姻，育有6个孩子。
　　林肯公园在今年五月发行了新专辑《多一丝曙光One More Light》，成为他们第五张登顶Billboard排行榜的专辑。而昨晚刚刚发布新单《Talking To Myself》MV。"""
t = """林肯公园主唱查斯特·贝宁顿自缢身亡，年仅41岁"""
get_summary(s, t, window=1)

In [None]:
get_summary(data.content[13], data.title[13], window=1)

#### lda train

In [None]:
from itertools import chain

StandardTokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

def segment(sentences, stopwords):
    """在split_to_sentence的基础上，生成分词文件。采用hanlp的StandardTokenizer。
    
    return:
        list of tokens for a doc.
    """
    total_tokens = []
    for sent in chain.from_iterable(sentences):
        tokens = [item.word for item in StandardTokenizer.segment(sent) \
                  if item.word not in stopwords]
        total_tokens.extend(tokens)
    return total_tokens

In [None]:
# train lda
from gensim import corpora, models, similarities

# input format: [['this', 'is', 'doc', 'one']，
#                ['this', 'is', 'doc', 'two']]

# 计算sentence embedding时，考虑要不要删除stopwords
# 由于计算sentence embedding的输入和lda的输入不一样，因此需要单独处理
# textrank的输入是和lda类似的。

to_lda = []
for doc in data.tokens:
    tokens = [token for token in doc.split(' ') if token not in stopwords]
    to_lda.append(tokens)    

https://radimrehurek.com/gensim/models/ldamodel.html

In [None]:
# mapping between normalized words and their integer ids.
dictionary = corpora.Dictionary(to_lda)

# bag of words
corpus = [dictionary.doc2bow(text) for text in to_lda]
# LDA模型（can be updated (trained) with new documents.）
# 参考cctv新闻网的新闻种类划分，topic选择12类
lda = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, alpha='auto')

dictionary.save('lda_dictionary.bin')

lda.save('lda_model.bin')

In [None]:
test_tokens = segment(data.content[12], stopwords)
test_bow = dictionary.doc2bow(test_tokens)

lda.get_document_topics(test_bow)

In [None]:
sum([i[1] for i in lda[test_bow]])

In [None]:
# Create a new corpus, made of previously unseen documents.
>>> other_texts = [
...     ['computer', 'time', 'graph'],
...     ['survey', 'response', 'eps'],
...     ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = lda[unseen_doc]  # get topic probability distribution for a document

In [None]:
# Update the model by incrementally training on the new corpus
>>> lda.update(other_corpus)
>>> vector = lda[unseen_doc]

In [None]:
lda.show_topics(num_topics=10, num_words=15, log=False, formatted=True)

In [None]:
lda.show_topic(1, topn=20)

In [None]:
# 旅游相关
lda.show_topic(1, topn=20)

In [None]:
# 经济
lda.show_topic(2, topn=20)

In [None]:
# 教育
lda.show_topic(3, topn=20)

In [None]:
# 国际贸易
lda.show_topic(4, topn=20)

In [None]:
# 政治
lda.show_topic(5, topn=20)

In [None]:
# 文化
lda.show_topic(6, topn=20)

In [None]:
# 国际安全
lda.show_topic(7, topn=20)

In [None]:
# 政府政策
lda.show_topic(8, topn=20)

In [None]:
# 体育
lda.show_topic(9, topn=20)

In [None]:
# 重复了
lda.show_topic(10, topn=20)

In [None]:
# 重复了
lda.show_topic(11, topn=20)

In [None]:
# lda.print_topics(num_topics=20, num_words=10)
# lda.print_topic(topicno, topn=10)