python 3.7

# 利用Gensim中的Word2Vec库进行embedding转换

- 目的：词汇embedding转换、查询相近词
- 输入：输入语料库(text)，查询单词
- 输出：词向量、相近词(model['word'] \ model.wv.similar_by_word("word", topn=10))
- 主要步骤：
1. 文本预处理
2. 模型训练
3. 相关查询

## 1. 文本预处理

In [None]:
import re
import jieba

def preprocess(file_path):
    texts = []
    with open(file_path, "r", encoding="utf-8") as f:
        for text in f.readlines():
            text = re.sub("[^\u4e00-\u9fa5。？．，！：]", "",
                          text.strip())  # 只保留中文以及基本的标点符号
            text_splited = re.split("[。？．，！：]", text)  # 按照基本的标点符号进行分块
            texts += text_splited
    text = [jieba.lcut(text)
                   for text in texts if text is not ""]  # 去除空字符且分词
    return text

In [None]:
text = preprocess('/text_files/test.txt')

## 2. 模型训练

In [None]:
from gensim.models import word2vec
model = word2vec.Word2Vec(text, min_count=2, window=5, vector_size=100)

## 3. 相关查询

In [None]:
print(model["Jack"])

In [None]:
print(model.wv.similar_by_word("Jack", topn=10))

