## 参考文献
### Gensim Word2Vec Tutorial – Full Working Example
https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.Xe5O-5MzY1J

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### 数据集
OpinRank —— 旅馆评论数据

In [4]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")

2019-12-09 22:23:47,614 : INFO : reading file reviews_data.txt.gz...this may take a while
2019-12-09 22:23:47,621 : INFO : read 0 reviews
2019-12-09 22:23:50,771 : INFO : read 10000 reviews
2019-12-09 22:23:53,569 : INFO : read 20000 reviews
2019-12-09 22:23:56,722 : INFO : read 30000 reviews
2019-12-09 22:24:00,798 : INFO : read 40000 reviews
2019-12-09 22:24:06,147 : INFO : read 50000 reviews
2019-12-09 22:24:09,153 : INFO : read 60000 reviews
2019-12-09 22:24:12,248 : INFO : read 70000 reviews
2019-12-09 22:24:15,389 : INFO : read 80000 reviews
2019-12-09 22:24:19,040 : INFO : read 90000 reviews
2019-12-09 22:24:21,949 : INFO : read 100000 reviews
2019-12-09 22:24:24,535 : INFO : read 110000 reviews
2019-12-09 22:24:27,032 : INFO : read 120000 reviews
2019-12-09 22:24:30,082 : INFO : read 130000 reviews
2019-12-09 22:24:32,744 : INFO : read 140000 reviews
2019-12-09 22:24:35,228 : INFO : read 150000 reviews
2019-12-09 22:24:38,391 : INFO : read 160000 reviews
2019-12-09 22:24:42,627

### 训练Word2Vec模型
参数含义
 - size
 代表每个词向量的大小，一般100~150比较合适
 - window
 代表目标词与邻近词之间的距离，超过这个距离则认为和目标词不相关，越近则越相似
 - min_count
 过滤未超过min_count的词
 - workers
 工作线程数

In [5]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2019-12-09 22:25:21,705 : INFO : collecting all words and their counts
2019-12-09 22:25:21,707 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-12-09 22:25:22,241 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2019-12-09 22:25:22,696 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2019-12-09 22:25:23,221 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2019-12-09 22:25:23,754 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2019-12-09 22:25:24,387 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2019-12-09 22:25:25,036 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2019-12-09 22:25:25,699 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2019-12-09 22:25:26,221 : INFO : PROG

2019-12-09 22:26:38,790 : INFO : EPOCH 1 - PROGRESS: at 62.80% examples, 512285 words/s, in_qsize 20, out_qsize 1
2019-12-09 22:26:39,808 : INFO : EPOCH 1 - PROGRESS: at 65.21% examples, 515552 words/s, in_qsize 17, out_qsize 2
2019-12-09 22:26:40,820 : INFO : EPOCH 1 - PROGRESS: at 67.66% examples, 520997 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:26:41,825 : INFO : EPOCH 1 - PROGRESS: at 69.69% examples, 523346 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:26:42,828 : INFO : EPOCH 1 - PROGRESS: at 71.90% examples, 527372 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:26:43,840 : INFO : EPOCH 1 - PROGRESS: at 73.97% examples, 528029 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:26:44,860 : INFO : EPOCH 1 - PROGRESS: at 75.93% examples, 529885 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:26:45,881 : INFO : EPOCH 1 - PROGRESS: at 78.00% examples, 532719 words/s, in_qsize 20, out_qsize 2
2019-12-09 22:26:46,883 : INFO : EPOCH 1 - PROGRESS: at 80.32% examples, 536491 words/s,

2019-12-09 22:27:43,096 : INFO : EPOCH 2 - PROGRESS: at 87.91% examples, 598261 words/s, in_qsize 20, out_qsize 0
2019-12-09 22:27:44,115 : INFO : EPOCH 2 - PROGRESS: at 90.18% examples, 599149 words/s, in_qsize 19, out_qsize 1
2019-12-09 22:27:45,122 : INFO : EPOCH 2 - PROGRESS: at 92.61% examples, 601046 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:27:46,162 : INFO : EPOCH 2 - PROGRESS: at 94.44% examples, 599315 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:27:47,185 : INFO : EPOCH 2 - PROGRESS: at 95.99% examples, 596229 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:27:48,202 : INFO : EPOCH 2 - PROGRESS: at 97.61% examples, 593431 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:27:49,204 : INFO : EPOCH 2 - PROGRESS: at 99.76% examples, 593798 words/s, in_qsize 9, out_qsize 2
2019-12-09 22:27:49,205 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-12-09 22:27:49,217 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-12-09 22:27:49,2

2019-12-09 22:28:38,985 : INFO : EPOCH 4 - PROGRESS: at 4.19% examples, 422729 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:28:39,998 : INFO : EPOCH 4 - PROGRESS: at 6.74% examples, 510153 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:28:41,029 : INFO : EPOCH 4 - PROGRESS: at 9.05% examples, 553761 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:28:42,113 : INFO : EPOCH 4 - PROGRESS: at 9.91% examples, 511223 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:28:43,113 : INFO : EPOCH 4 - PROGRESS: at 11.08% examples, 498058 words/s, in_qsize 17, out_qsize 2
2019-12-09 22:28:44,127 : INFO : EPOCH 4 - PROGRESS: at 13.22% examples, 530444 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:28:45,138 : INFO : EPOCH 4 - PROGRESS: at 15.30% examples, 550487 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:28:46,154 : INFO : EPOCH 4 - PROGRESS: at 16.88% examples, 550707 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:28:47,177 : INFO : EPOCH 4 - PROGRESS: at 18.63% examples, 557609 words/s, in_

2019-12-09 22:29:44,033 : INFO : EPOCH 5 - PROGRESS: at 44.03% examples, 764160 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:45,043 : INFO : EPOCH 5 - PROGRESS: at 46.89% examples, 765930 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:46,049 : INFO : EPOCH 5 - PROGRESS: at 49.68% examples, 767393 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:47,056 : INFO : EPOCH 5 - PROGRESS: at 52.47% examples, 769596 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:48,063 : INFO : EPOCH 5 - PROGRESS: at 55.10% examples, 770471 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:49,065 : INFO : EPOCH 5 - PROGRESS: at 57.84% examples, 771144 words/s, in_qsize 19, out_qsize 1
2019-12-09 22:29:50,085 : INFO : EPOCH 5 - PROGRESS: at 60.59% examples, 771291 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:51,095 : INFO : EPOCH 5 - PROGRESS: at 63.30% examples, 771105 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:29:52,095 : INFO : EPOCH 5 - PROGRESS: at 66.09% examples, 772144 words/s,

2019-12-09 22:30:42,838 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-12-09 22:30:42,840 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-12-09 22:30:42,841 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-12-09 22:30:42,848 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-12-09 22:30:42,849 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-12-09 22:30:42,861 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-12-09 22:30:42,870 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-12-09 22:30:42,872 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-12-09 22:30:42,873 : INFO : EPOCH - 1 : training on 41519358 raw words (30346396 effective words) took 38.1s, 796127 effective words/s
2019-12-09 22:30:43,901 : INFO : EPOCH 2 - PROGRESS: at 2.52% examples, 777304 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:30:44

2019-12-09 22:31:40,691 : INFO : EPOCH 3 - PROGRESS: at 44.56% examples, 771228 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:31:41,730 : INFO : EPOCH 3 - PROGRESS: at 46.29% examples, 754689 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:31:42,736 : INFO : EPOCH 3 - PROGRESS: at 48.95% examples, 755287 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:31:43,738 : INFO : EPOCH 3 - PROGRESS: at 51.29% examples, 751852 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:31:44,750 : INFO : EPOCH 3 - PROGRESS: at 53.45% examples, 748000 words/s, in_qsize 20, out_qsize 2
2019-12-09 22:31:45,750 : INFO : EPOCH 3 - PROGRESS: at 55.78% examples, 743817 words/s, in_qsize 16, out_qsize 3
2019-12-09 22:31:46,932 : INFO : EPOCH 3 - PROGRESS: at 57.35% examples, 726223 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:31:47,947 : INFO : EPOCH 3 - PROGRESS: at 58.73% examples, 713415 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:31:48,949 : INFO : EPOCH 3 - PROGRESS: at 60.32% examples, 703923 words/s,

2019-12-09 22:32:45,714 : INFO : EPOCH 4 - PROGRESS: at 63.59% examples, 526971 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:32:46,718 : INFO : EPOCH 4 - PROGRESS: at 65.51% examples, 527256 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:32:47,844 : INFO : EPOCH 4 - PROGRESS: at 66.61% examples, 520875 words/s, in_qsize 15, out_qsize 4
2019-12-09 22:32:48,920 : INFO : EPOCH 4 - PROGRESS: at 68.31% examples, 519278 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:32:50,043 : INFO : EPOCH 4 - PROGRESS: at 69.77% examples, 516121 words/s, in_qsize 19, out_qsize 2
2019-12-09 22:32:51,046 : INFO : EPOCH 4 - PROGRESS: at 70.86% examples, 512097 words/s, in_qsize 19, out_qsize 2
2019-12-09 22:32:52,190 : INFO : EPOCH 4 - PROGRESS: at 72.02% examples, 506594 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:32:53,206 : INFO : EPOCH 4 - PROGRESS: at 73.13% examples, 501836 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:32:54,406 : INFO : EPOCH 4 - PROGRESS: at 74.30% examples, 496063 words/s,

2019-12-09 22:33:51,192 : INFO : EPOCH 5 - PROGRESS: at 77.90% examples, 601529 words/s, in_qsize 20, out_qsize 0
2019-12-09 22:33:52,200 : INFO : EPOCH 5 - PROGRESS: at 79.41% examples, 597947 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:33:53,234 : INFO : EPOCH 5 - PROGRESS: at 80.75% examples, 593297 words/s, in_qsize 19, out_qsize 1
2019-12-09 22:33:54,237 : INFO : EPOCH 5 - PROGRESS: at 82.16% examples, 589125 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:33:55,264 : INFO : EPOCH 5 - PROGRESS: at 83.55% examples, 584853 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:33:56,296 : INFO : EPOCH 5 - PROGRESS: at 85.67% examples, 586297 words/s, in_qsize 20, out_qsize 0
2019-12-09 22:33:57,292 : INFO : EPOCH 5 - PROGRESS: at 87.40% examples, 584092 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:33:58,306 : INFO : EPOCH 5 - PROGRESS: at 89.90% examples, 586413 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:33:59,322 : INFO : EPOCH 5 - PROGRESS: at 92.37% examples, 588608 words/s,

2019-12-09 22:34:55,798 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-12-09 22:34:55,811 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-12-09 22:34:55,834 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-12-09 22:34:55,836 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-12-09 22:34:55,870 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-12-09 22:34:55,942 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-12-09 22:34:55,970 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-12-09 22:34:56,003 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-12-09 22:34:56,008 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-12-09 22:34:56,037 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-12-09 22:34:56,038 : INFO : EPOCH - 6 : training on 41519358 raw words (30351603 effe

2019-12-09 22:35:52,483 : INFO : EPOCH 8 - PROGRESS: at 23.20% examples, 708304 words/s, in_qsize 19, out_qsize 1
2019-12-09 22:35:53,485 : INFO : EPOCH 8 - PROGRESS: at 25.32% examples, 710208 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:35:54,489 : INFO : EPOCH 8 - PROGRESS: at 28.01% examples, 710894 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:35:55,490 : INFO : EPOCH 8 - PROGRESS: at 30.55% examples, 712121 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:35:56,516 : INFO : EPOCH 8 - PROGRESS: at 33.25% examples, 712385 words/s, in_qsize 20, out_qsize 3
2019-12-09 22:35:57,528 : INFO : EPOCH 8 - PROGRESS: at 35.70% examples, 712569 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:35:58,530 : INFO : EPOCH 8 - PROGRESS: at 38.44% examples, 715754 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:35:59,536 : INFO : EPOCH 8 - PROGRESS: at 40.81% examples, 713470 words/s, in_qsize 20, out_qsize 1
2019-12-09 22:36:00,554 : INFO : EPOCH 8 - PROGRESS: at 43.59% examples, 715043 words/s,

2019-12-09 22:36:56,769 : INFO : EPOCH 9 - PROGRESS: at 74.04% examples, 696845 words/s, in_qsize 20, out_qsize 0
2019-12-09 22:36:57,772 : INFO : EPOCH 9 - PROGRESS: at 76.32% examples, 697921 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:36:58,772 : INFO : EPOCH 9 - PROGRESS: at 78.63% examples, 698540 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:36:59,775 : INFO : EPOCH 9 - PROGRESS: at 80.90% examples, 698692 words/s, in_qsize 20, out_qsize 1
2019-12-09 22:37:00,788 : INFO : EPOCH 9 - PROGRESS: at 83.30% examples, 698833 words/s, in_qsize 20, out_qsize 0
2019-12-09 22:37:01,796 : INFO : EPOCH 9 - PROGRESS: at 85.50% examples, 698633 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:37:02,811 : INFO : EPOCH 9 - PROGRESS: at 88.12% examples, 699125 words/s, in_qsize 18, out_qsize 1
2019-12-09 22:37:03,838 : INFO : EPOCH 9 - PROGRESS: at 90.71% examples, 699460 words/s, in_qsize 19, out_qsize 0
2019-12-09 22:37:04,845 : INFO : EPOCH 9 - PROGRESS: at 93.02% examples, 698967 words/s,

(303497830, 415193580)

In [8]:
# 测试与w1相似的词语
w1 = "happy"
model.wv.most_similar (positive=w1)

[('pleased', 0.8111482262611389),
 ('satisfied', 0.7393232583999634),
 ('delighted', 0.6575272083282471),
 ('thrilled', 0.6547014713287354),
 ('impressed', 0.6537867784500122),
 ('disappointed', 0.5851640105247498),
 ('dissapointed', 0.560068666934967),
 ('grateful', 0.5515925884246826),
 ('willing', 0.5409046411514282),
 ('displeased', 0.5393090844154358)]

In [13]:
# topn
w1 = ["sad"]
model.wv.most_similar (positive=w1,topn=6)

[('upset', 0.5102924108505249),
 ('curious', 0.5052732825279236),
 ('unhappy', 0.4968779981136322),
 ('upsetting', 0.48018184304237366),
 ('shocked', 0.4725891351699829),
 ('embarrassed', 0.46805045008659363)]

In [16]:
w1 = ["beijing"]
model.wv.most_similar (positive=w1,topn=6)

[('shanghai', 0.8714462518692017),
 ('dubai', 0.8569789528846741),
 ('sf', 0.8416804671287537),
 ('chicago', 0.8219022154808044),
 ('london', 0.8056219220161438),
 ('montreal', 0.787434995174408)]

In [29]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)

[('duvet', 0.6999406218528748),
 ('blanket', 0.6942123174667358),
 ('mattress', 0.6912200450897217),
 ('quilt', 0.6899371147155762),
 ('pillowcase', 0.6777721643447876),
 ('matress', 0.6653842329978943),
 ('foam', 0.6418854594230652),
 ('pillowcases', 0.6369457840919495),
 ('pillows', 0.6335356831550598),
 ('sheets', 0.6127753853797913)]

In [32]:
# 两个词之间的相似度
model.wv.similarity(w1="dirty",w2="smelly")

0.756151

In [34]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [35]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.27739334

In [37]:
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [39]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])

'shower'