## 物品画像

采用TFIDF、TextRank提取关键词，并将其共现的词作为物品画像标签；对TFIDF关键词进行平均embedding，得到向量结果通过最近邻搜索可以进行相关推荐。

##### TFIDF

TFIDF(词频逆文档频率)可以衡量词在文档中的重要程度,核心思想是当一个词在当前文档中出现频率很高，但当前词在所有文档中出现频率很低，说明这个词对于这篇文档很重要。

$tfidf=TF(词频) * IDF(逆文档频率)$

TF(词频)=$\frac{词在当前文档中出现的次数}{文档的总词数}$，即当前词在当前文档中出现的频率。

IDF(逆文档频率)=log$\frac{文件总数}{包含词语的文件数目}$。

##### TextRank

TextRank利用文档内部的词语之间的共线关系来抽取关键词。

$S(V_{i})=1-d+d*\sum_{j∈In(V_{i})}\frac{S(V_{j})}{|Out(V_{j})|}$

d的作用是使计算结果更加平滑，公式最后一部分表示当前词i的权重是所有与i相邻的词的权重和，公式的分母部分是对类似“虽然、的”这类常见（与其他词共现多）且不重要的词进行惩罚。

##### Doc2vec

加载训练好的Word2vec模型，然后对TFIDF关键词进行平均embedding。

In [2]:
import os
import sys
import time
import codecs
import numpy as np
import pandas as pd

import jieba.analyse
import jieba.posseg as posseg
from operator import itemgetter
from collections import defaultdict

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from pandarallel import pandarallel
pandarallel.initialize()

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

def timmer(func):
    """ 装饰器，监控运行时间 """
    def wrapper(*args, **kwargs):
        before_time = time.time()
        f = func(*args, **kwargs)
        print("--> RUN TIME: <%s> : %s" % (func.__name__, time.time() - before_time))
        return f
    return wrapper

INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/




##### 获取数据

In [3]:
df = pd.read_csv('../data/news/news.csv')
df = df.head(1000)
print('shape: ', df.shape)
df.head(2)

shape:  (1000, 11)


Unnamed: 0,timestamp,title,source,head_img,publish_time,url,category,keyword,tag,description,content
0,2021/4/29 20:02:38,对话张恩华：武磊的言行代表着中国足球 希望更多人留洋,tencent,https://inews.gtimg.com/newsapp_ls/0/905250541...,2021-04-29 18:36:10,http://new.qq.com/cmsn/20190522/20190522003669...,运动,"张恩华,武磊,国足,腾讯体育,博斯克,中超",张恩华;武磊;国足;腾讯体育;博斯克;中超,对话张恩华：武磊的言行代表着中国足球 希望更多人留洋,腾讯体育5月22日马德里讯（文/吴昊宇）距离万众瞩目的欧冠决战还有10天左右，马德里的万达大...
1,2021/6/5 05:02:23,为了生个优质宝宝，孕前、孕期检查很重要，这些项目必须查！,tencent,https://inews.gtimg.com/newsapp_ls/0/136089782...,2021-06-03 16:44:03,https://new.qq.com/omn/20180827/20180827A0TXHJ...,育儿,,孕期检查;宝宝;胎儿;人流;怀孕;顺产;畸形儿;婴儿,随着环境的恶劣、食品安全问题日益严重，早产儿、畸形儿等各种病态患儿越来越多，孕前检查显得尤为...,随着环境的恶劣、食品安全问题日益严重，早产儿、畸形儿等各种病态患儿越来越多，孕前检查显得尤为...


##### 分词

加载用户字典、停止词词典进行分词。

In [4]:
def get_stopwords(stopwords_path):
    """ 得到stopwords列表 """
    stopwords = [
        i.strip() for i in codecs.open(stopwords_path, encoding='utf-8').readlines()
    ]
    return stopwords

def cut_sentence(sentence, stopwords):
    """ 分词结果过滤，保留名词、英文和自定义词库中的词，以及长度大于2的词 """
    import jieba.posseg as posseg
    seg_list = posseg.lcut(sentence)
    seg_list = [i for i in seg_list if i.word not in stopwords]
    filtered_words_list = []
    
    for seg in seg_list:
        if len(seg.word) <= 1:
            continue
        elif seg.flag == "eng":
            if len(seg.word) <= 2:
                continue
            else:
                filtered_words_list.append(seg.word)
        elif seg.flag.startswith(
                "n") or seg.flag == "x" or seg.flag == "v" or seg.flag == "j" or seg.flag == "s" or seg.flag == "t":
            filtered_words_list.append((seg.word, seg.flag))

    return filtered_words_list

@timmer
def get_segments(df, stopwords):
    """ 获取分词结果 """
    df['segments'] = df['content'].parallel_apply(cut_sentence, args=(stopwords,))

In [6]:
# 通过重复title增加文本标题关键词权重
def get_content(row):
    return str(row.title)*3 + ' ' + str(row.description) + ' ' + str(row.content)
df['content'] = df.parallel_apply(get_content, axis=1)

# 加载用户字典， 获取停用词
abspath = "../data/news/"
user_dict_path = os.path.join(abspath, "dictionary.txt")
jieba.load_userdict(user_dict_path)
stopwords_path = os.path.join(abspath, "stopwords.txt")
idf_path = os.path.join(abspath, "idf.txt")
wv_model_path = os.path.join(abspath, "wv_50features_5mincount_5window")

# 分词(保留词和词性)
stopwords = get_stopwords(stopwords_path)
get_segments(df, stopwords)

--> RUN TIME: <get_segments> : 23.699240684509277


##### TFIDF

In [7]:
@timmer
def get_tfidf(docs, df, idf_path):
    """ 根据语料训练tfidf模型 """
    tfidf = Tfidf(idf_path)
    tfidf_result = []
    for doc in docs:
        keywords = tfidf.extract_keywords(doc)
        tfidf_result.append(keywords)
    df['tfidf'] = tfidf_result
    
class Tfidf:
    def __init__(self, idf_file):
        """
        读取词典idf值，并计算出平均idf
        Args:
            idf_file IDF文件
        """
        self._idf = {}
        self._idf_default = 0
        with open(idf_file, 'r', encoding='utf-8') as idf:
            for line in idf:
                word = line.strip().split()
                self._idf[word[0]] = float(word[1])
                self._idf_default += float(word[1])
        self._idf_default /= float(len(self._idf))

    def extract_keywords(self, items, top=10):
        """
        抽取关键词，关键词根据tf*idf排名
        Args:
            words 分好词后的文档列表
            top 关键词个数
        """
        keywords = {}
        count = len(items)
        for item in items:
            word = item[0]
            if word not in keywords:
                keywords[word] = 0
            keywords[word] += 1

        for word in keywords:
            idf = self._idf_default
            if word in self._idf:
                idf = self._idf[word]
            keywords[word] = keywords[word] / count * idf

        return sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:top]

######## 根据全量数据统计idf文档(可选) START ########
@timmer
def save_idf_file(docs, idf_path):
    """ 获取idf文档并保存到文件 """ 
    idf_dict = {}
    for doc in docs:
        for item in set(doc):
            word = item[0]
            if word not in idf_dict:
                idf_dict[word] = 0.0
            idf_dict[word] += 1.0
    doc_count = len(docs)
    for word in idf_dict:
        idf_dict[word] = np.log1p(doc_count / idf_dict[word])
        
    f = open(idf_path, 'w', encoding='utf-8')
    for word, idf in idf_dict.items():
        line = word + ' ' + str(idf)
        f.write(line)
        f.write('\n')
    f.close()
# 重新训练IDF文件时打开
#save_idf_file(df.segments.values, idf_path)
######## 根据全量数据统计idf文档 END ########

get_tfidf(df.segments.values, df, idf_path)

--> RUN TIME: <get_tfidf> : 3.3137106895446777


##### TextRank

In [8]:
class TextRank(jieba.analyse.TextRank):
    def __init__(self, window=20, word_min_len=2):
        super(TextRank, self).__init__()  # 首先找到TextRank父类，然后把当前类对象转化为父类对象
        self.span = window  # 窗口大小
        self.word_min_len = word_min_len  # 单词的最小长度
        self.pos_filter = frozenset(
            ('n', 'x', 'eng', 'f', 's', 't', 'nr', 'ns', 'nt', "nw", "nz", "PER", "LOC", "ORG")
        )
    
    def textrank(self, words_list, flags_list, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """save_tfidf_textrank
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        class Wp:
            def __init__(self, word, flag):
                self.__word = word
                self.__flag = flag

            @property
            def word(self):
                return self.__word

            @word.setter
            def word(self, word):
                self.__word = word

            @property
            def flag(self):
                return self.__flag

            @flag.setter
            def flag(self, flag):
                self.__flag = flag

        class UndirectWeightedGraph:
            d = 0.85

            def __init__(self):
                self.graph = defaultdict(list)

            def addEdge(self, start, end, weight):
                # use a tuple (start, end, weight) instead of a Edge object
                self.graph[start].append((start, end, weight))
                self.graph[end].append((end, start, weight))

            def itervalues(self, d):
                return iter(d.values())
                
            def rank(self):
                ws = defaultdict(float)
                outSum = defaultdict(float)

                wsdef = 1.0 / (len(self.graph) or 1.0)
                for n, out in self.graph.items():
                    ws[n] = wsdef
                    outSum[n] = sum((e[2] for e in out), 0.0)

                # this line for build stable iteration
                sorted_keys = sorted(self.graph.keys())
                for x in range(10):  # 10 iters
                    for n in sorted_keys:
                        s = 0
                        for e in self.graph[n]:
                            s += e[2] / outSum[e[1]] * ws[e[1]]
                        ws[n] = (1 - self.d) + self.d * s

                (min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
                for w in self.itervalues(ws):
                    if w < min_rank:
                        min_rank = w
                    if w > max_rank:
                        max_rank = w

                for n, w in ws.items():
                    # to unify the weights, don't *100.
                    ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)

                return ws

        self.pos_filt = frozenset(allowPOS)
        g = UndirectWeightedGraph()
        cm = defaultdict(int)

        wp_list = []
        for i in range(len(words_list)):
            wp = Wp(words_list[i], flags_list[i])
            wp_list.append(wp)

        words = tuple(wp_list)
        for i, wp in enumerate(words):
            if self.pairfilter(wp):
                for j in range(i + 1, i + self.span):
                    if j >= len(words):
                        break
                    if not self.pairfilter(words[j]):
                        continue
                    if allowPOS and withFlag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1
        for terms, w in cm.items():
            g.addEdge(terms[0], terms[1], w)
        nodes_rank = g.rank()
        if withWeight:
            sorted_tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
        else:
            sorted_tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
        if topK:
            return sorted_tags[:topK]
        else:
            return sorted_tags
    extract_tags = textrank

@timmer
def get_textrank(docs, df):
    # TextRank过滤窗口大小为5，单词最小为2
    textrank_model = TextRank(window=5, word_min_len=2)
    # 允许词性：名词、未知数或符号、英文、人名、地名、机构名、新词、其他名词、连词
    allow_pos = ("n", "x", "eng", "nr", "ns", "nt", "nw", "nz", "c")  
    
    textrank_result = []
    for doc in docs:
        word_list, flag_list = [], []
        for item in doc:
            word_list.append(item[0])
            flag_list.append(item[1])
        keywords = textrank_model.textrank(word_list, flag_list, topK=20, withWeight=True, allowPOS=allow_pos, withFlag=False)
        textrank_result.append(keywords)
    df['textrank'] = textrank_result
    
get_textrank(df.segments.values, df)

--> RUN TIME: <get_textrank> : 3.400704860687256


##### Doc2vec

In [9]:
class Embedding(object):
    """ word2vec模型 """
    def __init__(self, size=128, window=5, 
                 min_count=5, workers=5, 
                 epochs=50, pretrained_model=None):
        """
        训练词嵌入向量
        Args:
            size - 向量维度
            window - 窗口长度
            min_count - 最小词频
            workers - 并行化
            epochs - 迭代次数
            pretrained_model - 预训练模型
        """
        self._model = None
        self._size = size
        self._window = window
        self._min_count = min_count
        self._workers = workers
        self._epochs = epochs
        if pretrained_model:
            self._model = Word2Vec.load(pretrained_model)
            
    @timmer
    def train(self, sentences=[]):
        if self._model:
            self._model.train(sentences,
                             total_examples = len(sentences),
                             epochs = self._epochs)
        else:
            self._model = Word2Vec(sentences,
                                   vector_size=self._size,
                                   window=self._window,
                                   min_count=self._min_count,
                                   workers=self._workers)
    def save(self, model_path = None):
        self._model.save(model_path)
    
    @property
    def model(self):
        return self._model

In [10]:
# 重新训练word2vec模型时打开
######## 根据全量数据训练word2vec模型(可选) START ########
# sentences = [[item[0] for item in item_list] for item_list in df.segments.values]
# wv_emb = Embedding(size=50,
#                      window=5, 
#                      min_count=5, 
#                      workers=5,
#                      epochs=5, 
#                      pretrained_model=None)
# wv_emb.train(sentences)
# wv_emb.save(wv_model_path)
######## 根据全量数据训练word2vec模型 END ########

# 加载训练好的word2vec模型
wv_emb = Embedding(size=50,
                     window=5, 
                     min_count=5, 
                     workers=5,
                     epochs=5, 
                     pretrained_model=wv_model_path)
wv_model = wv_emb.model

In [11]:
@timmer
def get_doc2vec(docs, df):
    """ 根据Tfidf关键词计算平均embedding """
    sentences = [list(set([item[0] for item in doc])) for doc in docs]
    not_in_dict_set = set()
    avg_wv_arr = []
    for sentence in sentences:
        wv_arr = np.array([0.0] * 50)
        cnt = 0
        for i in range(len(sentence)):
            try:
                wv_arr = wv_arr + wv_model.wv[sentence[i]]
                cnt += 1
            except Exception as e:
                not_in_dict_set.add(sentence[i])
        if cnt == 0:
            avg_wv_arr.append(wv_arr)
        else:
            avg_wv_arr.append(wv_arr / cnt)

    print('not_in_dict_set cnt: {}'.format(len(not_in_dict_set)))
    df['doc2vec'] = avg_wv_arr

get_doc2vec(df.tfidf.values, df)

not_in_dict_set cnt: 290
--> RUN TIME: <get_doc2vec> : 0.04487752914428711


##### TFIDF与TextRank共现词作为兴趣标签

In [12]:
def process_tags(row):
    """ TFIDF与TextRank共现词作为兴趣标签 """
    tags = list(set([item[0] for item in row.tfidf]) & set([item[0] for item in row.textrank]))
    return tags

@timmer
def get_tags(df):
    df['tags'] = df.parallel_apply(process_tags, axis=1)

get_tags(df)

--> RUN TIME: <get_tags> : 2.060326099395752


In [13]:
df.head(1)

Unnamed: 0,timestamp,title,source,head_img,publish_time,url,category,keyword,tag,description,content,segments,tfidf,textrank,doc2vec,tags
0,2021/4/29 20:02:38,对话张恩华：武磊的言行代表着中国足球 希望更多人留洋,tencent,https://inews.gtimg.com/newsapp_ls/0/905250541...,2021-04-29 18:36:10,http://new.qq.com/cmsn/20190522/20190522003669...,运动,"张恩华,武磊,国足,腾讯体育,博斯克,中超",张恩华;武磊;国足;腾讯体育;博斯克;中超,对话张恩华：武磊的言行代表着中国足球 希望更多人留洋,对话张恩华：武磊的言行代表着中国足球 希望更多人留洋对话张恩华：武磊的言行代表着中国足球 希...,"[(对话, n), (张恩华, nr), (武磊, nr), (言行, n), (代表, n...","[(张恩华, 0.3434719541875263), (足球, 0.18942164938...","[(足球, 1.0), (张恩华, 0.9718222586235503), (中国, 0....","[1.1075875863432885, 0.7062447622418404, -1.58...","[言行, 球员, 博斯克, 代表, 足球, 对话, 留洋, 武磊, 张恩华, 青训]"


In [14]:
def print_tag_info(df, index_list=[0,1]):
    """ 打印tags信息 """ 
    for idx in index_list:
        row = df.iloc[idx]
        _tfidf = [i[0] for i in row.tfidf]
        _textrank =  [i[0] for i in row.textrank]
        _tags =  [i for i in row.tags]
        print(row.title)
        print(_tfidf, _textrank, _tags, '\n')

print_tag_info(df)

对话张恩华：武磊的言行代表着中国足球 希望更多人留洋
['张恩华', '足球', '青训', '留洋', '武磊', '言行', '博斯克', '代表', '对话', '球员'] ['足球', '张恩华', '中国', '青训', '代表', '留洋', '武磊', '球员', '对话', '言行', '腾讯', '感觉', '西甲', '深圳', '西班牙', '博斯克', '总监', '踢球', '东西', '直观'] ['言行', '球员', '博斯克', '代表', '足球', '对话', '留洋', '武磊', '张恩华', '青训'] 

为了生个优质宝宝，孕前、孕期检查很重要，这些项目必须查！
['胎儿', '孕前', '畸形儿', '孕期', '宝宝', '感染', '确宝', '弓形虫', '优质', '死胎'] ['胎儿', '宝宝', '优质', '建议', '畸形儿', '问题', '死胎', '医生', '顺产', '项目', '孩子', '医院', '激素', '糖尿病', '脊柱', '疾病', '代表', '厚度', '唐氏儿', '病毒'] ['优质', '宝宝', '死胎', '畸形儿', '胎儿'] 



##### 优化方向：

1.根据文章所属行业类型整理出一份行业专属词典。

2.尝试采用更优秀的命名实体识别模型替换掉jieba分词。