任务4：使用中文词向量完成mean/max/sif句子编码

- 步骤1 ：单词通过word2vec编码为100维向量，则句子编码为N∗100N∗100的矩阵，N为句子单词个数。
- 步骤2 ：将N*100的矩阵进行max-pooling编码，转为100维度。
- 步骤3 ：将N*100的矩阵进行mean-pooling编码，转为100维度。
- 步骤4 ：将N*100的矩阵与单词的IDF进行矩阵相乘，即按照单词的词频进行加权，进行tfidf-pooling编码，转为100维度。
- 步骤5 ：学习SIF编码的原理，进行sif编码，转为100维度。
> https://github.com/PrincetonML/SIF/blob/master/src/SIF_embedding.py#L30

> https://openreview.net/pdf?id=SyK00v5xx
- 步骤6（可选） ：通过上述步骤2-步骤5的编码，计算相似句子的相似度 vs 不相似句子的相似度， 绘制得到分布图，哪一种编码最优？

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#隐藏警告
import warnings
import jieba
warnings.filterwarnings('ignore')

In [27]:
def read_tsv(input_file,columns):
    with open(input_file,"r",encoding="utf-8") as file:
        lines = []
        count = 1
        for line in file:
            if len(line.strip().split("\t")) != 1:
                lines.append([count]+line.strip().split("\t"))
                count += 1
        df = pd.DataFrame(lines)
        df.columns = columns
    return df
bq_train = read_tsv('data/bq_corpus/train.tsv',['index','sentences1','sentences2','label'])
bq_train.head()

Unnamed: 0,index,sentences1,sentences2,label
0,1,用微信都6年，微信没有微粒贷功能,4。号码来微粒贷,0
1,2,微信消费算吗,还有多少钱没还,0
2,3,交易密码忘记了找回密码绑定的手机卡也掉了,怎么最近安全老是要改密码呢好麻烦,0
3,4,你好我昨天晚上申请的没有打电话给我今天之内一定会打吗？,什么时候可以到账,0
4,5,"“微粒贷开通""",你好，我的微粒贷怎么没有开通呢,0


In [28]:
#大写字母转为小写字母
def upper2lower(sentence):
    new_sentence=sentence.lower()
    return new_sentence
bq_train['chinese_sentences1'] = bq_train['sentences1'].apply(upper2lower)
bq_train['chinese_sentences2'] = bq_train['sentences2'].apply(upper2lower)

#去除文本中的表情字符（只保留中英文和数字）
import re
def clear_character(sentence):
    pattern1= '\[.*?\]'     
    pattern2 = re.compile('[^\u4e00-\u9fa5^a-z^A-Z^0-9]')   
    line1=re.sub(pattern1,'',sentence)
    line2=re.sub(pattern2,'',line1)   
    new_sentence=''.join(line2.split()) #去除空白
    return new_sentence
bq_train['chinese_sentences1'] = bq_train['chinese_sentences1'].apply(clear_character)
bq_train['chinese_sentences2'] = bq_train['chinese_sentences2'].apply(clear_character)

In [30]:
def segment_sen(sen):
    sen_list = []
    try:
        sen_list = jieba.lcut(sen)
    except:
        pass
    return sen_list
sen1_list = [segment_sen(i) for i in bq_train['chinese_sentences1']]
sen2_list = [segment_sen(i) for i in bq_train['chinese_sentences2']]

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Litra\AppData\Local\Temp\jieba.cache
Loading model cost 0.466 seconds.
Prefix dict has been built successfully.


In [10]:
#先读取w2v_model
from gensim.models import Word2Vec
model1 = Word2Vec.load('word2vec_1.model')
model2 = Word2Vec.load('word2vec_2.model')

In [31]:
from nltk.text import TextCollection

#构建sentences1和sentences2的语料库corpus1和corpus2
corpus1 = TextCollection(sen1_list)  
corpus2 = TextCollection(sen2_list)

In [48]:
#得到sif权重
def get_SIF_weight(a, sentences, corpus):
    SIF_weight = {}
    for sentence in sentences:
        for word in sentence:
            SIF_weight[word] =  a / a + corpus.tf_idf(word, sentence)
    return SIF_weight

In [49]:
a = 0.001
SIF_weight1 = get_SIF_weight(a, sen1_list, corpus1)
SIF_weight2 = get_SIF_weight(a, sen2_list, corpus2)

In [63]:
#获得基于SIF改进后的句子向量-->输入的是单个句子，返回的
import math
def build_sentences_vector_sif_weight(sentences,size,w2v_model,sif_weight):
    all_sentences_metrix = np.zeros((1, size))
    sen_vec=np.zeros(size).reshape((1,size))
    for index, sentence in enumerate(sentences):
        count=0
        for word in sentence:
            try:
                if word in sif_weight.keys():
                    sen_vec+=(np.dot(w2v_model.wv[word],math.exp(sif_weight[word]*0.001))).reshape((1,size))
                    count+=1
                else:
                    sen_vec+=w2v_model[word].reshape((1,size))
                    count+=1
            except KeyError:
                continue
        if count!=0:
            sen_vec/=count
        if index == 0:
            all_sentences_metrix = sen_vec
        else:
            all_sentences_metrix = np.vstack((all_sentences_metrix, sen_vec))
    return all_sentences_metrix

In [None]:
metrix_sen1 = build_sentences_vector_sif_weight(sen1_list, 100, model1, SIF_weight1)
metrix_sen2 = build_sentences_vector_sif_weight(sen2_list, 100, model2, SIF_weight2)

In [65]:
import numpy as np
from sklearn.decomposition import TruncatedSVD

def compute_pc(X,npc=1):
    """
    Compute the principal components. DO NOT MAKE THE DATA ZERO MEAN!
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: component_[i,:] is the i-th pc
    """
    svd = TruncatedSVD(n_components=npc, n_iter=7, random_state=0)
    svd.fit(X)
    return svd.components_

def remove_pc(X, npc=1):
    """
    Remove the projection on the principal components
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: XX[i, :] is the data point after removing its projection
    """
    pc = compute_pc(X, npc)
    if npc==1:
        XX = X - X.dot(pc.transpose()) * pc
    else:
        XX = X - X.dot(pc.transpose()).dot(pc)
    return XX


def SIF_embedding(sentences, size, w2v_model, sif_weight, npc):
    """
    Compute the scores between pairs of sentences using weighted average + removing the projection on the first principal component
    :param We: We[i,:] is the vector for word i
    :param x: x[i, :] are the indices of the words in the i-th sentence
    :param w: w[i, :] are the weights for the words in the i-th sentence
    :param params.rmpc: if >0, remove the projections of the sentence embeddings to their first principal component
    :return: emb, emb[i, :] is the embedding for sentence i
    """
    emb = build_sentences_vector_sif_weight(sentences,size,w2v_model,sif_weight)
    if  npc > 0:
        emb = remove_pc(emb, npc)
    return emb

In [66]:
sif_embedding_1= SIF_embedding(sen1_list, 100, model1, SIF_weight1, 1)
sif_embedding_2= SIF_embedding(sen2_list, 100, model2, SIF_weight2, 1)

In [68]:
sif_embedding_1

array([[ 0.1631529 ,  0.68570732, -0.56345899, ...,  0.05019303,
         0.08823364,  0.73780842],
       [ 0.1631529 ,  0.68570732, -0.56345899, ...,  0.05019303,
         0.08823364,  0.73780842],
       [-0.45038705,  0.14196152, -0.29118772, ..., -0.94526421,
         0.30650289,  0.00765647],
       ...,
       [-0.17717167, -0.1151685 ,  0.64575971, ...,  0.43572011,
        -0.2061713 , -0.20033421],
       [-0.24399607, -0.0214033 ,  0.93929948, ...,  0.97110859,
        -0.61965028, -0.61955608],
       [ 0.11829696, -0.06113644, -0.26674104, ..., -0.30710764,
         0.16443198, -0.07984093]])

In [71]:
np.savetxt('sif_embedding1.txt', sif_embedding_1)
np.savetxt('sif_embedding2.txt', sif_embedding_2)