## 1.3 Gensim

Gensim 是一个开源的python库，可以将文档表示为语义向量。

官网：https://radimrehurek.com/gensim/

- Word2vec
- FastText
- TF-IDF, LSA, LDA

思考：为什么要把词表示为向量？


In [3]:
import gensim
import gensim.downloader as api

list(api.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [2]:
glove_vectors = api.load('glove-twitter-25')

In [4]:
# 查看与'twitter'最相近的词
glove_vectors.most_similar('twitter')

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]

In [5]:
# 查看'computer'的词向量
glove_vectors['computer']

array([ 0.64005 , -0.019514,  0.70148 , -0.66123 ,  1.1723  , -0.58859 ,
        0.25917 , -0.81541 ,  1.1708  ,  1.1413  , -0.15405 , -0.11369 ,
       -3.8414  , -0.87233 ,  0.47489 ,  1.1541  ,  0.97678 ,  1.1107  ,
       -0.14572 , -0.52013 , -0.52234 , -0.92349 ,  0.34651 ,  0.061939,
       -0.57375 ], dtype=float32)

### 下面使用文本中预训练的词向量进行情感分类

In [6]:
from nltk.corpus import movie_reviews
import random
random.seed(42)


def load_movie_reviews():
    pos_ids = movie_reviews.fileids('pos')
    neg_ids = movie_reviews.fileids('neg')

    all_reviews = []
    for pids in pos_ids:
        all_reviews.append((movie_reviews.raw(pids), 'positive'))
    
    for nids in neg_ids:
        all_reviews.append((movie_reviews.raw(nids), 'negative'))

    random.shuffle(all_reviews)
    train_reviews = all_reviews[:1600]
    test_reviews = all_reviews[1600:]

    return train_reviews, test_reviews

train_reviews, test_reviews = load_movie_reviews()
print('train:', len(train_reviews))
print('test:', len(test_reviews))

train: 1600
test: 400


In [7]:
from nltk import word_tokenize
import numpy as np

# 将文本中每个词的词向量的平均作为文本的表示
def convert_text_to_vector(text, vectors):
    vector = np.zeros(vectors.vector_size)
    num = 0
    for word in word_tokenize(text):
        if word in vectors:
            vector = vector + vectors[word]
            num += 1
    if num > 0:
        vector = vector / num
    return vector

In [8]:
def build_X_y(reviews, vectors):
    X = []
    Y = []
    
    for review, polarity in reviews:
        x = convert_text_to_vector(review, vectors)
        y = 0 if polarity == 'negative' else 1
        X.append(x)
        Y.append(y)

    return X, Y


In [9]:
X_train, y_train = build_X_y(train_reviews, glove_vectors)
X_test, y_test = build_X_y(test_reviews, glove_vectors)

In [10]:
from sklearn.svm import LinearSVC


def train_and_test(X_train, y_train, X_test, y_test):
    classifier = LinearSVC()

    classifier.fit(X_train, y_train)
    accuracy = classifier.score(X_test, y_test)
    print(f'accuracy is {accuracy:.4f}')

    return classifier

In [11]:
train_and_test(X_train, y_train, X_test, y_test)

accuracy is 0.7200




In [1]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=".\\data\\glove.twitter.27B.50d.txt", word2vec_output_file=".\\data\\gensim_glove_vectors.txt")

  glove2word2vec(glove_input_file=".\\data\\glove.twitter.27B.50d.txt", word2vec_output_file=".\\data\\gensim_glove_vectors.txt")


(1193514, 50)

In [4]:
# from gensim.models import KeyedVectors
from gensim.models.keyedvectors import KeyedVectors
# 使用50维预训练的词向量
glove_vectors_50 = KeyedVectors.load_word2vec_format('.\\data\\gensim_glove_vectors.txt', binary=False)

In [5]:
X_train, y_train = build_X_y(train_reviews, glove_vectors_50)
X_test, y_test = build_X_y(test_reviews, glove_vectors_50)

NameError: name 'build_X_y' is not defined

In [12]:
train_and_test(X_train, y_train, X_test, y_test)

accuracy is 0.7275


LinearSVC()

思考：如何进一步的改进？