# 词袋模型
## 在NLP任务中初始的处理是对文本数据进行分句与分词
### 词袋模型计算句子相似度过程
- 1.对文本数据进行分句与分词
- 2.构建词表（语料库）
- 3.构建词表（语料库）的数字映射
- 4.建立句子向量表示
- 5.计算相似度

In [3]:
sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

## 分词
- 英文分词
    - NLTK
- 中文分词
    - jieba

### 1、分词

In [4]:
from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

In [5]:
texts

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'],
 ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]

### 2、构建语料库

In [6]:
all_list = []
for text in texts:
    all_list += text
corpus = set(all_list)

In [7]:
corpus

{',', '.', 'I', 'like', 'love', 'reading', 'running', 'sea', 'sky'}

### 3、语料库构建数字映射


In [8]:
corpus_dict = dict(zip(corpus, range(len(corpus))))

In [9]:
corpus_dict

{'love': 0,
 'I': 1,
 ',': 2,
 '.': 3,
 'running': 4,
 'sky': 5,
 'sea': 6,
 'like': 7,
 'reading': 8}

### 4、建立句子的向量表示

In [10]:
def vector_rep(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))

    vec = sorted(vec, key= lambda x: x[0])

    return vec

vec1 = vector_rep(texts[0], corpus_dict)
vec2 = vector_rep(texts[1], corpus_dict)
print(vec1)
print(vec2)

[(0, 2), (1, 2), (2, 1), (3, 1), (4, 0), (5, 1), (6, 1), (7, 0), (8, 0)]
[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 0), (6, 0), (7, 1), (8, 1)]


### 句子相似度计算

In [11]:
from math import sqrt
def similarity_with_2_sents(vec1, vec2):
    inner_product = 0
    square_length_vec1 = 0
    square_length_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner_product += tup1[1]*tup2[1]
        square_length_vec1 += tup1[1]**2
        square_length_vec2 += tup2[1]**2

    return (inner_product/sqrt(square_length_vec1*square_length_vec2))

In [12]:
cosine_sim = similarity_with_2_sents(vec1, vec2)
print('两个句子的余弦相似度为： %.4f。'%cosine_sim)

两个句子的余弦相似度为： 0.7303。


## 使用gensim计算句子相似度

In [14]:
sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)

from gensim import corpora
from gensim.similarities import Similarity

#  语料库
dictionary = corpora.Dictionary(texts)

# 利用doc2bow作为词袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)

# 获取句子的相似度
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
print("利用gensim计算得到两个句子的相似度： %.4f。"%cosine_sim)

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]
Similarity<2 documents in 0 shards stored under -Similarity-index>
利用gensim计算得到两个句子的相似度： 0.7303。
