# Phrase Similarity


Modified based on: [Phrase Embedding Method Introduction](https://towardsdatascience.com/sentence-embedding-3053db22ea77)

### Represent Phrase Embeddings by **Word Centroid**: 

    Adding every word's vector then averaged by word counts to get words centroids, which is considered as phrase vector.
    
Problem:
1. Inversed meaning with the same words' compositions. **(Neglecting Words Sequence)**  
    `similarity('好吃不贵','好贵不吃','cosine')` = 2.5125567e-07  
    
    
2. Inversed meaning with one negative word. **(Negative Words)**  
    `similarity('好吃','不好吃','cosine')` = 2.1768517  
    `similarity('好吃','难吃','cosine')` = 3.7945359
        
        
3. Same meaning with different words.  **(Semantic Understanding)**  
    `similarity('老板热情','服务周到','cosine')` = 4.8603044
    
Similarity is calculated as the cosine distance value of two phrase centroid.

### Represent Phrase Embeddings by **Word Movers' Distance**:  
    
    Find the minimum travel distance of every non-stop words in phrase 1 to get non-stop words in phrase 2.
    Adding them together to get the phrase representation.
    
<div align=center>   
<img src="https://cdn-images-1.medium.com/max/1600/1*JICAZM0gRjD9kSyMZtm99Q.png" width="40%" height="40%" />
</div>


### Stacking  
    
1. Sift first time to select top 20 similar tags using word centroid.

First step is based on word-level similarity, so this process can eliminate irrelevant tags, but there are some phrase containing exact same words but with different meaning. 


    
2. Using WMD to get more refined results

    The smallest particle in WMD is phrase, thus based on the preliminary results, WMD can find more relevant tags.
    

# Result

**Result of Word Centroid**

In [85]:
num = 987
display_index = range(num,num+8)
pd.DataFrame(wc_res.iloc[display_index,:]).T

Unnamed: 0,987,988,989,990,991,992,993,994
tags,重庆必吃,游戏活动,英式下午茶,精美茶点,针灸瘦身项目,古典精致,浦江夜色,古典风格
1,婚庆公司商家特色,冬天游玩,德式酒吧,有美腰机,身体紧致项目,古典风格,闽南特色,民国风格摄影
2,创战纪主题失重餐厅,广告活动,粤式早茶,酒店下午茶,半永久纹眼线项目,古典音乐演奏,江西五十铃,欧式风格
3,不满意重拍,活动标签,萌宠下午茶,餐饮点心评价,水氧活肤项目,精致婚纱礼服,江南风情,田园风格
4,土豪必去,温泉活动,网红下午茶,招牌千岛鱼头鲜嫩多汁,肉毒素除皱项目,古典装修,江西菜,古典乐酒吧
5,重庆菜,春天游玩,港式茶点,摆盘精美,祛黑眼圈项目,精致早餐,湛江菜,园林风格
6,重庆火锅,游戏礼包,书店下午茶,菜品精致,肉毒素缩鼻头项目,古典/复古家具,23层外滩江景,民谣风格
7,天津必吃,棋牌游戏,英式酒吧,书店下午茶,瘦脸项目,古典乐酒吧,吉林江北,购物风格
8,旅行必吃,网吧优惠活动,下午茶套餐,舌尖中国2出镜美食,黄金微针项目,精致风格摄影,夜色赞,小清新风格纹身
9,婚庆公司擅长风格,秋天游玩,高颜值下午茶,用餐地点神秘,身体净肤项目,古典舞,江景下午茶,古典/复古家具


**Result of Linear Combination of Word Centroid and Word Mover's Distance**

In [87]:
pd.DataFrame(wmd_res.iloc[display_index,:]).T

Unnamed: 0,987,988,989,990,991,992,993,994
tags,重庆必吃,游戏活动,英式下午茶,精美茶点,针灸瘦身项目,古典精致,浦江夜色,古典风格
1,成都必吃,电玩游戏,日式下午茶,港式茶点,按摩瘦身项目,古朴精致,外滩夜景,古典装修
2,武汉必吃,吃鸡游戏,网红下午茶,摆盘精美,仪器瘦身项目,精致婚礼,夜色赞,古典/复古家具
3,上海必吃,游戏特权,中式下午茶,精致点心,肉毒素瘦脸项目,精致风格摄影,23层外滩江景,古典乐酒吧
4,西安必吃,网吧活动,英式酒吧,精致点心,瘦脸针项目,古典/复古家具,江南人家,复古风格摄影
5,天津必吃,游戏开发,下午茶特色,精致本帮菜馆,微针美肤项目,精致拉花,江南风情,复古风格瓷砖


# Code

In [1]:
#coding=utf-8
from gensim.models import KeyedVectors
import pandas as pd
import numpy as np
import jieba
import jieba.posseg as pseg

## Embedding

Using [pre-trained embedding](https://github.com/Embedding/Chinese-Word-Vectors), which is obtained by Word2Vec with mixed various corpus resources.

In [2]:
word_vectors = KeyedVectors.load_word2vec_format('sgns.merge.bigram.bz2') 

## Import tags

Clean up the duplicated tags

In [4]:
def read_tags(mt_fp,dp_fp):
    mt = pd.read_csv(mt_fp).dropna(axis=1)
    dp = pd.read_csv(dp_fp).dropna(axis=1)
    dp = dp.rename(columns={'tagname':'name'})
    all_tags = pd.concat([mt,dp], axis=0)
    all_tags = all_tags.drop_duplicates()
    print 'tag_counts:',mt.shape[0], dp.shape[0], all_tags.shape[0]
    return all_tags

In [5]:
mt_fp = 'poi_tag/mt_tags.csv'
dp_fp = 'poi_tag/dp_tags.csv'
all_tags = read_tags(mt_fp,dp_fp)

tag_counts: 454 3880 4210


## Clean tags

In [17]:
# Add cleaning stop words
# Cleaned some meaningless Non-breaking space
def clean_text(all_tags):
    clean_tags = []
    for tag in all_tags:
        tag = tag.replace("\xc2\xa0", "")
        tag = tag.replace(" ", "")
        clean_tags.append(tag)
    return clean_tags

In [18]:
tags = clean_text(all_tags['name'].tolist())

# Word Centroid 

## Generating Similarity Matrix

In [21]:
def generate_similarity_matrix(all_tags):
    tag_len = all_tags.shape[0]
    matrix = dict()
    tags = all_tags['name'].tolist()
    tags = clean_text(tags)
    for tag in tags:
        matrix[tag.decode('utf-8')] = []
        matrix[tag.decode('utf-8')] = map(lambda x: similarity(tag,x,'cosine'),tags)
    return matrix

In [22]:
def cos_dis(v1, v2):
    return np.linalg.norm(v1-v2)
def similarity(s1,s2,dis_method):
    s1_len = len(s1.decode('utf-8'))
    s2_len = len(s2.decode('utf-8'))
    s1_vecs = map(lambda x: word_vectors[x], [word for word in s1.decode('utf-8')])
    s2_vecs = map(lambda x: word_vectors[x], [word for word in s2.decode('utf-8')])
    s1_vec = sum(s1_vecs)/s1_len
    s2_vec = sum(s2_vecs)/s2_len
    if(dis_method == 'cosine'):
        return cos_dis(s1_vec,s2_vec)

In [23]:
matrix = generate_similarity_matrix(all_tags)
print len(tags),len(matrix)

4210 4210


## Select Top 20 Similar Tags

In [24]:
def qselect(A,k):
    if len(A)<k:return A
    pivot = A[-1]
    right = [pivot] + [x for x in A[:-1] if x<pivot]
    rlen = len(right)
    if rlen==k:
        return right
    if rlen>k:
        return qselect(right, k)
    else:
        left = [x for x in A[:-1] if x>=pivot]
        return qselect(left, k-rlen) + right

In [45]:
# Input: 
#     matrix: similarity matrix
#     res_num: top res_num result

def wc_similar_tags(matrix, res_num=20):
    result = pd.DataFrame()
    wc_score = dict()
    for stag in tags:
        arr = qselect(matrix[stag.decode('utf-8')],21)
        candidate = []
        candidate_score = []
        candidate.append(stag)
        for score in arr:
            if(score==0.0): continue
            candidate.append(tags[matrix[stag.decode('utf-8')].index(score)])
            candidate_score.append(score)
        result = result.append(pd.DataFrame(candidate).T)
        wc_score[stag] = candidate_score

    columns_name = ['tags']
    map(lambda x: columns_name.append(str(x)), range(1,res_num+1))
    result.columns= columns_name

    result = result.reset_index()
    result = result.drop('index',axis=1)
    return [result,wc_score]

In [46]:
[wc_res,wc_score] = wc_similar_tags(matrix, 20)

In [47]:
wc_res.to_csv('similar_tags_wc.csv')

## Word Mover's Distance

In [48]:
# # input : p1 p2
# p1 = result.iloc[col,0]
# p2 = result.iloc[col,9]


def WMD_core(p1,p2):
    p1_vect = []
    # if can't find corresponding word embeddings, add single word embeddings
    # 分词后的词语若无对应词向量，则添加该单字向量
    
    for word in list(jieba.cut(p1)):
        if word in word_vectors.vocab:
            p1_vect.append(word_vectors[word])
        else: p1_vect.extend(map(lambda x: word_vectors[x], word))

    p2_vect = []
    for word in list(jieba.cut(p2)):
        if word in word_vectors.vocab:
            p2_vect.append(word_vectors[word])
        else: p2_vect.extend(map(lambda x: word_vectors[x], word))

#     print len(p1_vect), len(p2_vect)
    total_min = []
    sum_min = 0.0
    for w1 in p1_vect: 
        cur_min = 1000.0
        min_dis = []
        for w2 in p2_vect:
            temp = cos_dis(w1,w2)
            if temp < cur_min:
                cur_min = temp
        min_dis.append(cur_min)
        sum_min += cur_min
        total_min.extend(min_dis)
#     print total_min
    return sum_min

In [49]:
# WMD Wrapper
# input p1, p2_list
def WMD(p1,p2_list):
    sum_res = []
    for p2 in p2_list:
        if not pd.isnull(p2):
            sum_res.append(WMD_core(p1,p2))
    return sum_res

In [57]:
# Input:
#      
def wmd_similar_tags(tags, wc_result, wc_score, res_num=5, wmd_weight=0.9):
    
    wmd_res = pd.DataFrame()
    for col in range(len(tags)):
        t_name = wc_result.iloc[col,:].tolist()
        td_score = [None]
        wmd_score = [None]
        p1 = wc_result.iloc[col,0]
        p2_list = wc_result.iloc[col,1:].tolist()
        td_score.extend(wc_score[t_name[0]])
        wmd_score.extend(WMD(p1,p2_list)) 
    #     temp = pd.DataFrame([t_name, td_score, wmd_score])
    #     temp.rename(index={0: 'name',1:'wc_score',2:'wmd_score'},columns={0:'tags'})

        # linearly combine two score
        total_score = pd.DataFrame(td_score)*(1-wmd_weight) + pd.DataFrame(wmd_score)*wmd_weight
        score_list = total_score[0].tolist()

        # select 5 lowest distance result
        arr = qselect(score_list,res_num)
        candidate = []
        sub_candidate = []
        candidate.append(t_name[0])
        for score in arr:
            if(score==0.0): continue
            sub_candidate.append(t_name[score_list.index(score)])
        candidate.extend(sub_candidate[::-1])
        wmd_res = wmd_res.append(pd.DataFrame(candidate).T)


    columns_name = ['tags']
    map(lambda x: columns_name.append(str(x)), range(1,res_num+1))
    wmd_res.columns= columns_name
    wmd_res = wmd_res.reset_index()
    wmd_res = wmd_res.drop('index',axis=1)
    return wmd_res

In [58]:
wmd_res = wmd_similar_tags(tags, wc_res, wc_score, 5, 0.8)
wmd_res.to_csv('similar_tags_wmd.csv')